openzim / node-libzim

Libzim binding for Node.js: read/write ZIM files in Javascript
https://www.npmjs.com/package/@openzim/libzim
GNU General Public License v3.0
27 stars 11 forks source link

Move to libzim7 #69

Closed kelson42 closed 2 years ago

kelson42 commented 3 years ago

Libzim7 has just been published and provides a lot of improvements. But the API has slightly changed, so we need to adapt our code. Probably also a good opportunity to clean a few things in the CI and the doc.

kelson42 commented 3 years ago

@kelvinhammond Would you be interested to lead this. You work on https://github.com/openzim/node-libzim/pull/35 was great an much appreciated. A bounty could be organised.

kelvinhammond commented 3 years ago

@kelvinhammond Would you be interested to lead this. You work on https://github.com/openzim/node-libzim/pull/35 was great an much appreciated. A bounty could be organized.

I'll look at it, will make a PR if I do it or reply back to you

kelson42 commented 3 years ago

@kelvinhammond Thank you very much!

kelvinhammond commented 3 years ago

@kelson42 I'll start on this most likely around October 27th, 2021. I don't think this will take too long to do

kelson42 commented 3 years ago

@kelvinhammond Really glad to hear it. Thank you again very much. Let me know if you have questions. We have just published the Python libzim binding so we have two developers who have done something similar. You can also write to me an email at kelson at kiwix.org to get an invitation to our Slack. The last week of october is really a good time as we have a meetup at that time. Therfore people will be extra available to give help or test things.

kelvinhammond commented 3 years ago

@kelson42 Can you re-invite me to collaborate on this project?

kelson42 commented 3 years ago

@kelvinhammond done

kelvinhammond commented 2 years ago

@kelson42 I'm looking into how I can implement the bindings for the creator efficiently (least amount of code) as you can subclass the Item(s) class, based on what I saw in the python module.

What do you think of this for the new module exports / api? This is a rough draft based on what I understand so far about the new api.

Questions:


class ZimFileFormatError;
class InvalidType;
class EntryNotFound;

class Blob; // pretty much same as before / item
class UUID;
class Item;
class Entry;
class Query;
class Searcher;
class Search;
class SuggestionSearch; // same as Search implementation.

class SearchResult {
  // mapped to `SearchIterator` c++ class internally
  // stores a reference to it.
}

class SearchResultSet {
  size();
  [Symbol.iterator]: -> SearchResult // see EntryRange notes
}

const EntryOrder = Object.freeze({
  pathOrder: Symbol("pathOrder"),
  titleOrder: Symbol("titleOrder"),
  efficientOrder: Symbol("efficientOrder")
});

class EntryRange {
  size();
  offset(start, maxResults) -> EntryRange
  [Symbol.iterator]: -> Entry // for iterating, need to verify this will work because the function scope require an iterator object.
};

class Archive {
  // NOTE: file descriptor constructors skipped, will add if needed
  constructor(path: string) { }

  getFilename() -> string
  // ... rest of accessor functions

  iterByPath() -> EntryRange
  // allows `for (const entry of archive.iterByPath()) { }
  // ... rest of accessor iter functions
};

// This needs some work, and I need to grok things a bit better still
// The problem is making sure the item is still available after the archive has
class StringProvider;
class FileProvider;

class WriterItem {
  // user provided functions with sane defaults
  // for getContentProvider, internally will return a new ContentProvider each time with a unique pointer
  getContentprovider() -> String or File providers instance
  getHints() -> Object{ [Hints.FRONT_ARTICLE]: True } // hints object to be implemented similar to `EntryOrder` enum
};

class Creator {
  // on addItem etc a new item is created from the JS world Item Object
};
kelson42 commented 2 years ago

@kelvinhammond Great to see progress on this. I will let @rgaudin and @mgautierfr answering your technical questions. They have respectively built/updated the Python binding and the libzim(7). You should be in best hands :)

rgaudin commented 2 years ago

In any way, I think @mgautierfr should be of better help.

mgautierfr commented 2 years ago

Hi @kelvinhammond, glade to see you working on this.

I'm not sure to understand correctly your question about reference and searchiterater (the notion of reference may be a bit different between cpp and nodejs) But here some general consideration to have in mind :

if (entry) { // do stuff with entry pointer }

// you can even return the pointer, to store it somewhere without copying the entry itself (it is a kind of reference) return entry;

This is what is used in python wrapper.

So for the reference question, it depends of what you call reference.
If this is just a pointer pointing to a `SearchIterator` but not keeping the `SearchIterator` alive, no you cannot.
If the reference is some kind of higher nodejs construction which keep the `SearchIterator` alive (as `shared_ptr`/`unique_ptr` do in cpp), yes, it is ok.

---

```cpp
// This needs some work, and I need to grok things a bit better still
// The problem is making sure the item is still available after the archive has
class StringProvider;
class FileProvider;

On python side, we decide it would be too complicated to wrapp those classes correctly so we reimplement them. It is probably the same for nodejs. It is easier to implement a ContentProvider with a correct feed method that creating a cpp instance of StringProvider, giving it the string in a efficient/safe way.

kelson42 commented 2 years ago

@kelvinhammond Let me know if you need more things to discuss, we can invite you as well to our Slack channel to get a quicker/better way to discuss these complex topics.

kelvinhammond commented 2 years ago

@kelson42 mind if I add an in-progress tag to this ticket?

I'm a bit slow, as I'm working and life, but I am working on this.

kelson42 commented 2 years ago

@kelvinhammond Of course! Maybe we shoukd split this ticket in a few subs one?

kelvinhammond commented 2 years ago

FYI, updating dependencies as part of this

kelson42 commented 2 years ago

@kelvinhammond If you have something, feel free to open a draft PR.

kelvinhammond commented 2 years ago

@kelson42 How do you feel about this being header only? I'll make a PR once I'm to that point, I ran into some issues with error catching and had to rewrite some things. https://github.com/nodejs/node-addon-api/issues/1104

kelson42 commented 2 years ago

@kelvinhammond Nice to see you are still on it. I know this is not an easy job. Let me know if you need anything I could help with.

kelvinhammond commented 2 years ago

Progress so far:

Done plus notes:

Changes: A lot of functions use accessor methods for functions that take no params and return a value. Example:

item.index // is item.getIndex()
archive.check() // is not an accessor because it does something and returns a value so its a method

Todo:

kelson42 commented 2 years ago

@rgaudin @mgautierfr Any feedback, in particular regarding the readon,y zim::blob. If possible we shoukd avoid the copy, and I’m sure you had a similar problem with pylibzim.

kelvinhammond commented 2 years ago

Working on a ContentProvider binding implementation. I'm thinking of following the python binding's method for the Item but allowing custom ContentProvider implementations somehow by calling javascript functions but I need to wrap the ContentProvider with a custom JavascriptContentProviderWrapper or whatever class. Any input before I proceed.

FYI: modified the Writer.startZimCreation in the binding so that it returns this (Creator&). I think the libzim project should do something similar @kelson42 && @mgautierfr. This would allow a user to chain config and then start all in one line(s).

kelvinhammond commented 2 years ago

Also of note, the long delays here are because of work, life, et cetera. Working on this on and off, almost done :D

kelson42 commented 2 years ago

@kelvinhammond Glad to read your last update. Without giving you a detailed feedback about the architecure change, I'm quite sure this is good to have a similar pattern like python-libzim. Because this has been strongly discussed and improved over the past two years, whereas the node-libzim is older and has been made by a junior dev.

We just need to remember that we will have to test the new node-libzim with mwoffliner and adapt it.

Regarding your proposal for libzim, please open a dedicated ticket on openzim/libzim so we can discuss it.

Thank you again for your effort. I can not wait to see the PR and test it wit MWoffliner. Wish you a good Christmas time.

kelvinhammond commented 2 years ago

Working through some threading issues here https://github.com/nodejs/node-addon-api/issues/1113

kelvinhammond commented 2 years ago

@mgautierfr @kelson42 After looking more into threading I have found a few issues.

Context

Issues

Possible Solution

Of Note

mgautierfr commented 2 years ago

It is a pity that js is singlethreaded only. This is almost the same with python which has a global lock to protect the whole interpreter (and the python structures) from race condition. But we can explicitly release the lock and acquire it when needed. This allow use to have libzim calling back python code from different threads (even if the python code is run single threaded). Maybe there is something equivalent in nodejs ?

Don't allow custom ContentProviders, only String and File providers for now. Does mwoffliner require a custom Provider?

It mainly depends on the user of the library (mwoffliner). The advantage of a custom providers (at least theoretically) is that the provider can be associated to a http request in the scrapper. Then the provider could return blobs to libzim while the download of the content is running. With a StringProvider, the scraper has to download all the content and after create the provider/item. But I don't know how mwoffliner is working. It is maybe a early optimization not needed.

Anyway, I think it is more important to have something working (so only with string and file providers) and then extend the wrapper to have the custom provider in a second time.

Item objects will be wrapped in a way that calls to getPath, getTitle, etc will be called on addItem, stored, and returned by the wrapper for Items so it won't be changeable or randomized. The methods are supposed to be const anyway so I don't see this being too much of an issue.

I agree. What is somehow important is the contentProvider and the indexData. Other attributes are small and can be stored in the item. This is what is made in BasicItem

kelvinhammond commented 2 years ago

@mgautierfr I figured out a way to allow multi-threading, they are still bound by the single threaded event loop in nodejs but they should be io bound during multi-thread processing. These changes are in the new PR.

@kelson42 I just need to write tests, add BasicItem, and then update the mwoffliner. I'll try to make more time to finish this.

kelson42 commented 2 years ago

@kelvinhammond Somgreat to hear this! You make my day! Wish you a good WE!

kelvinhammond commented 2 years ago

@kelson42 @mgautierfr Do we need to use / support typescript for this project? If not, I can omit and and just export the bindings without defining the typescript interfaces.

kelson42 commented 2 years ago

@kelvinhammond Yes, typescript is important. We do our best to write robust code.

kelvinhammond commented 2 years ago

Working on figuring out typescript still, I no longer need the .ts files. The zim.js ends up being the below because everything is in C++ now.

import bindings from 'bindings';

const {
    Archive,
    Entry,
    IntegrityCheck,
    Compression,
    Blob,
    Searcher,
    Query,
    SuggestionSearcher,
    Creator,
    StringProvider,
    FileProvider,
    StringItem,
    FileItem,
} = bindings('zim_binding');

module.exports = {
    Archive,
    Entry,
    IntegrityCheck,
    Compression,
    Blob,
    Searcher,
    Query,
    SuggestionSearcher,
    Creator,
    StringProvider,
    FileProvider,
    StringItem,
    FileItem,
};
kelvinhammond commented 2 years ago

@kelson42 How do I get into your slack? And who can help with typescript?

kelvinhammond commented 2 years ago

The problem I'm currently running into is that the @types/bindings https://www.npmjs.com/package/@types/bindings returns a binding of any for each of the imported classes and I can't easily override this as far as I can tell.

Do I just leave it as any which pretty much defeats the purpose of typescript or what?

import bindings from 'bindings';

const {
    Archive,
    Entry,
    IntegrityCheck,
    Compression,
    Blob,
    Searcher,
    Query,
    SuggestionSearcher,
    Creator,
    StringProvider,
    FileProvider,
    StringItem,
    FileItem,
} = bindings('zim_binding');

/** the following throws an error because its already defined: src/zim.ts:6:5 - error TS2451: Cannot redeclare block-scoped variable 'IntegrityCheck'.
declare class IntegrityCheck {
    static CHECKSUM: symbol;
    static DIRENT_PTRS: symbol;
    static DIRENT_ORDER: symbol;
    static TITLE_INDEX: symbol;
    static CLUSTER_PTRS: symbol;
    static DIRENT_MIMETYPES: symbol;
    static COUNT: symbol;
}
*/

export {
    Archive,
    Entry,
    IntegrityCheck,
    Compression,
    Blob,
    Searcher,
    Query,
    SuggestionSearcher,
    Creator,
    StringProvider,
    FileProvider,
    StringItem,
    FileItem,
}
kelvinhammond commented 2 years ago

This is messy but it works, I'll continue from here unless you have a better idea.

Normally I'd just do it in js and create a zim.d.ts file but then I'd need to pull in something to bundle the js files properly to the dist folder and I'd rather not do that.

import bindings from 'bindings';
const zim = bindings('zim_binding');

declare class IntegrityCheck {
    static CHECKSUM: symbol;
    static DIRENT_PTRS: symbol;
    static DIRENT_ORDER: symbol;
    static TITLE_INDEX: symbol;
    static CLUSTER_PTRS: symbol;
    static DIRENT_MIMETYPES: symbol;
    static COUNT: symbol;
}

const IntegrityCheckCls = <IntegrityCheck> zim.IntegrityCheck;

export {
    IntegrityCheckCls as IntegrityCheck,
}
kelson42 commented 2 years ago

@kelvinhammond I understand the problem with any but I'm far too incompetent to be able to discuss the solution with you. The overall approach is to be strict as much as possible, therefore our move a few years ago to typescript and my recommendation to use the types if possible. But if this is blocking you, I guess (like you looks like) we should move forward and open a dedicated ticket to handle this problem later. I would still stick as much as possible to the typescript formalism in general.

kelvinhammond commented 2 years ago

We'll still need typescript for mwoffliner, still trying to find something that works.

kelvinhammond commented 2 years ago

I figured something out, now for the tedious work of writing tests and defining classes in typescript.

kelvinhammond commented 2 years ago

@kelson42 I'll start on this most likely around October 27th, 2021. I don't think this will take too long to do

I don't think this will take too long to do

So that was a lie. PR here https://github.com/openzim/node-libzim/pull/72/files

kelson42 commented 2 years ago

@kelvinhammond To me it looks like you are making progress and things are on track even if this is widely more complex than you assessed at start. Right?

kelvinhammond commented 2 years ago

Yep, this version is pretty much done as far as I know so far.

kelson42 commented 2 years ago

@kelvinhammond Not that I believe that I could make a quality review of your code, but let me know explicitly please when you are over with your effort.

kelvinhammond commented 2 years ago

@kelson42 Its ready