Closed kelson42 closed 2 years ago
@kelvinhammond Would you be interested to lead this. You work on https://github.com/openzim/node-libzim/pull/35 was great an much appreciated. A bounty could be organised.
@kelvinhammond Would you be interested to lead this. You work on https://github.com/openzim/node-libzim/pull/35 was great an much appreciated. A bounty could be organized.
I'll look at it, will make a PR if I do it or reply back to you
@kelvinhammond Thank you very much!
@kelson42 I'll start on this most likely around October 27th, 2021. I don't think this will take too long to do
@kelvinhammond Really glad to hear it. Thank you again very much. Let me know if you have questions. We have just published the Python libzim binding so we have two developers who have done something similar. You can also write to me an email at kelson at kiwix.org to get an invitation to our Slack. The last week of october is really a good time as we have a meetup at that time. Therfore people will be extra available to give help or test things.
@kelson42 Can you re-invite me to collaborate on this project?
@kelvinhammond done
@kelson42 I'm looking into how I can implement the bindings for the creator efficiently (least amount of code) as you can subclass the Item(s) class, based on what I saw in the python module.
What do you think of this for the new module exports / api? This is a rough draft based on what I understand so far about the new api.
Questions:
Array.from(iter)
.
class ZimFileFormatError;
class InvalidType;
class EntryNotFound;
class Blob; // pretty much same as before / item
class UUID;
class Item;
class Entry;
class Query;
class Searcher;
class Search;
class SuggestionSearch; // same as Search implementation.
class SearchResult {
// mapped to `SearchIterator` c++ class internally
// stores a reference to it.
}
class SearchResultSet {
size();
[Symbol.iterator]: -> SearchResult // see EntryRange notes
}
const EntryOrder = Object.freeze({
pathOrder: Symbol("pathOrder"),
titleOrder: Symbol("titleOrder"),
efficientOrder: Symbol("efficientOrder")
});
class EntryRange {
size();
offset(start, maxResults) -> EntryRange
[Symbol.iterator]: -> Entry // for iterating, need to verify this will work because the function scope require an iterator object.
};
class Archive {
// NOTE: file descriptor constructors skipped, will add if needed
constructor(path: string) { }
getFilename() -> string
// ... rest of accessor functions
iterByPath() -> EntryRange
// allows `for (const entry of archive.iterByPath()) { }
// ... rest of accessor iter functions
};
// This needs some work, and I need to grok things a bit better still
// The problem is making sure the item is still available after the archive has
class StringProvider;
class FileProvider;
class WriterItem {
// user provided functions with sane defaults
// for getContentProvider, internally will return a new ContentProvider each time with a unique pointer
getContentprovider() -> String or File providers instance
getHints() -> Object{ [Hints.FRONT_ARTICLE]: True } // hints object to be implemented similar to `EntryOrder` enum
};
class Creator {
// on addItem etc a new item is created from the JS world Item Object
};
@kelvinhammond Great to see progress on this. I will let @rgaudin and @mgautierfr answering your technical questions. They have respectively built/updated the Python binding and the libzim(7). You should be in best hands :)
The entry iterators look fine. That's exactly what's needed. That's something we haven't implemented yet on pylibzim but you got it right.
My understand of your question is that you would expose the SearchResultSet
in JS and users would consume it. It doesn't seem appropriate as we don't want JS users to use this specific API (.begin()
and all). If your question was whether to construct a JS iterator and store it there (in Search
?), I'm not sure I understand the reason/consequences. Reading MDN doc makes it look like we can provide something similar to what's done in Python.
In any way, I think @mgautierfr should be of better help.
Hi @kelvinhammond, glade to see you working on this.
I'm not sure to understand correctly your question about reference and searchiterater (the notion of reference may be a bit different between cpp and nodejs) But here some general consideration to have in mind :
Blob
by following the chain Archive->Entry->Item->Blob
, you can safely destroy the Archive
/Entry
/Item
and still use the Blob
.Blob
. The Blob::data
returns you a pointer on some data. This data is guaranteed to be available as long as the Blob
is not destroyed. But if you destroy the blob, the data may be reused for something else. So you can reference the data but you must keep the blob alive somewhere. Or you simply copy the data.Entry
locate the entry in the archive but do not get the data. The data is getted (potentially decompressed) only when you call Item::getData
.shared_ptr
or unique_ptr
) :
std::unique_ptr<zim::Entry> entry; // default constructed to nullptr
try {
entry = std::unique_ptr<zim::Entry>(new zim::Entry(archive.getEntryByPath("foo"))); // get a entry and "copy" it on the heap.
// with c++14 (or shared_ptr, even in c++11) it is easier as you can use `make_unique`/`make_shared` :
entry = std::make_unique<zim::Entry>(archive.getEntryByPath("foo"));
} catch (zim::EntryNotFound&) {
// Nothing, keep the entry set to nullptr
}
if (entry) { // do stuff with entry pointer }
// you can even return the pointer, to store it somewhere without copying the entry itself (it is a kind of reference) return entry;
This is what is used in python wrapper.
So for the reference question, it depends of what you call reference.
If this is just a pointer pointing to a `SearchIterator` but not keeping the `SearchIterator` alive, no you cannot.
If the reference is some kind of higher nodejs construction which keep the `SearchIterator` alive (as `shared_ptr`/`unique_ptr` do in cpp), yes, it is ok.
---
```cpp
// This needs some work, and I need to grok things a bit better still
// The problem is making sure the item is still available after the archive has
class StringProvider;
class FileProvider;
On python side, we decide it would be too complicated to wrapp those classes correctly so we reimplement them. It is probably the same for nodejs.
It is easier to implement a ContentProvider
with a correct feed
method that creating a cpp instance of StringProvider
, giving it the string in a efficient/safe way.
@kelvinhammond Let me know if you need more things to discuss, we can invite you as well to our Slack channel to get a quicker/better way to discuss these complex topics.
@kelson42 mind if I add an in-progress
tag to this ticket?
I'm a bit slow, as I'm working and life, but I am working on this.
@kelvinhammond Of course! Maybe we shoukd split this ticket in a few subs one?
FYI, updating dependencies as part of this
BigInt
support for the 64bit integers (size_t
) used by zim::Archive
.@kelvinhammond If you have something, feel free to open a draft PR.
@kelson42 How do you feel about this being header only? I'll make a PR once I'm to that point, I ran into some issues with error catching and had to rewrite some things. https://github.com/nodejs/node-addon-api/issues/1104
@kelvinhammond Nice to see you are still on it. I know this is not an easy job. Let me know if you need anything I could help with.
Progress so far:
Done plus notes:
getMetadataItem
: must wait for the next version, its implemented but commented outgetUuid
: implemented using std:ostringstream
, cast to std::string
fails with undefined symbol: _ZNK3zim4UuidcvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEv
zim::Blob
is const
so for now I copy itChanges: A lot of functions use accessor methods for functions that take no params and return a value. Example:
item.index // is item.getIndex()
archive.check() // is not an accessor because it does something and returns a value so its a method
Todo:
@rgaudin @mgautierfr Any feedback, in particular regarding the readon,y zim::blob. If possible we shoukd avoid the copy, and I’m sure you had a similar problem with pylibzim.
Working on a ContentProvider
binding implementation. I'm thinking of following the python binding's method for the Item but allowing custom ContentProvider
implementations somehow by calling javascript functions but I need to wrap the ContentProvider with a custom JavascriptContentProviderWrapper
or whatever class. Any input before I proceed.
FYI: modified the Writer.startZimCreation
in the binding so that it returns this
(Creator&
). I think the libzim
project should do something similar @kelson42 && @mgautierfr. This would allow a user to chain config and then start all in one line(s).
Also of note, the long delays here are because of work, life, et cetera. Working on this on and off, almost done :D
@kelvinhammond Glad to read your last update. Without giving you a detailed feedback about the architecure change, I'm quite sure this is good to have a similar pattern like python-libzim. Because this has been strongly discussed and improved over the past two years, whereas the node-libzim is older and has been made by a junior dev.
We just need to remember that we will have to test the new node-libzim with mwoffliner and adapt it.
Regarding your proposal for libzim, please open a dedicated ticket on openzim/libzim so we can discuss it.
Thank you again for your effort. I can not wait to see the PR and test it wit MWoffliner. Wish you a good Christmas time.
Working through some threading issues here https://github.com/nodejs/node-addon-api/issues/1113
@mgautierfr @kelson42 After looking more into threading I have found a few issues.
Creator.finishZimCreation
blocks the calling thread, spawns other threads, and waits for those to finish.AsyncWorker
to spawn and thread to call finishZimCreation
and return a promise which I resolve once finishZimCreation
has returned.Item
and ContentProvider
classes as they are on the javascript stack which can't be accessed outside the main thread without using a TypedThreadSafeFunction
.TypedThreadSafeFunction
works by adding a function to the javascript queue but it does not block. This function is not called until the queue gets to it, so if finishZimCreation
runs on the javascript thread the program is locked and will never finish because the queue never runs the TypedThreadSafeFunction
on the queue. std::future
to get the results from the TypedThreadSafeFunction
ContentProvider
ContentProvider
s in javascript would help with this.ContentProvider
s, only String
and File
providers for now. Does mwoffliner
require a custom Provider
?Item
objects will be wrapped in a way that calls to getPath
, getTitle
, etc will be called on addItem
, stored, and returned by the wrapper for Item
s so it won't be changeable or randomized. The methods are supposed to be const
anyway so I don't see this being too much of an issue.mwoffliner
to in order to test it?It is a pity that js is singlethreaded only. This is almost the same with python which has a global lock to protect the whole interpreter (and the python structures) from race condition. But we can explicitly release the lock and acquire it when needed. This allow use to have libzim calling back python code from different threads (even if the python code is run single threaded). Maybe there is something equivalent in nodejs ?
Don't allow custom ContentProviders, only String and File providers for now. Does mwoffliner require a custom Provider?
It mainly depends on the user of the library (mwoffliner). The advantage of a custom providers (at least theoretically) is that the provider can be associated to a http request in the scrapper. Then the provider could return blobs to libzim while the download of the content is running. With a StringProvider, the scraper has to download all the content and after create the provider/item. But I don't know how mwoffliner is working. It is maybe a early optimization not needed.
Anyway, I think it is more important to have something working (so only with string and file providers) and then extend the wrapper to have the custom provider in a second time.
Item objects will be wrapped in a way that calls to getPath, getTitle, etc will be called on addItem, stored, and returned by the wrapper for Items so it won't be changeable or randomized. The methods are supposed to be const anyway so I don't see this being too much of an issue.
I agree. What is somehow important is the contentProvider and the indexData. Other attributes are small and can be stored in the item. This is what is made in BasicItem
@mgautierfr I figured out a way to allow multi-threading, they are still bound by the single threaded event loop in nodejs but they should be io bound during multi-thread processing. These changes are in the new PR.
@kelson42 I just need to write tests, add BasicItem
, and then update the mwoffliner. I'll try to make more time to finish this.
@kelvinhammond Somgreat to hear this! You make my day! Wish you a good WE!
@kelson42 @mgautierfr Do we need to use / support typescript for this project? If not, I can omit and and just export the bindings without defining the typescript interfaces.
@kelvinhammond Yes, typescript is important. We do our best to write robust code.
Working on figuring out typescript still, I no longer need the .ts
files. The zim.js
ends up being the below because everything is in C++ now.
import bindings from 'bindings';
const {
Archive,
Entry,
IntegrityCheck,
Compression,
Blob,
Searcher,
Query,
SuggestionSearcher,
Creator,
StringProvider,
FileProvider,
StringItem,
FileItem,
} = bindings('zim_binding');
module.exports = {
Archive,
Entry,
IntegrityCheck,
Compression,
Blob,
Searcher,
Query,
SuggestionSearcher,
Creator,
StringProvider,
FileProvider,
StringItem,
FileItem,
};
@kelson42 How do I get into your slack? And who can help with typescript?
The problem I'm currently running into is that the @types/bindings
https://www.npmjs.com/package/@types/bindings returns a binding of any
for each of the imported classes and I can't easily override this as far as I can tell.
Do I just leave it as any
which pretty much defeats the purpose of typescript or what?
import bindings from 'bindings';
const {
Archive,
Entry,
IntegrityCheck,
Compression,
Blob,
Searcher,
Query,
SuggestionSearcher,
Creator,
StringProvider,
FileProvider,
StringItem,
FileItem,
} = bindings('zim_binding');
/** the following throws an error because its already defined: src/zim.ts:6:5 - error TS2451: Cannot redeclare block-scoped variable 'IntegrityCheck'.
declare class IntegrityCheck {
static CHECKSUM: symbol;
static DIRENT_PTRS: symbol;
static DIRENT_ORDER: symbol;
static TITLE_INDEX: symbol;
static CLUSTER_PTRS: symbol;
static DIRENT_MIMETYPES: symbol;
static COUNT: symbol;
}
*/
export {
Archive,
Entry,
IntegrityCheck,
Compression,
Blob,
Searcher,
Query,
SuggestionSearcher,
Creator,
StringProvider,
FileProvider,
StringItem,
FileItem,
}
This is messy but it works, I'll continue from here unless you have a better idea.
Normally I'd just do it in js and create a zim.d.ts
file but then I'd need to pull in something to bundle the js files properly to the dist folder and I'd rather not do that.
import bindings from 'bindings';
const zim = bindings('zim_binding');
declare class IntegrityCheck {
static CHECKSUM: symbol;
static DIRENT_PTRS: symbol;
static DIRENT_ORDER: symbol;
static TITLE_INDEX: symbol;
static CLUSTER_PTRS: symbol;
static DIRENT_MIMETYPES: symbol;
static COUNT: symbol;
}
const IntegrityCheckCls = <IntegrityCheck> zim.IntegrityCheck;
export {
IntegrityCheckCls as IntegrityCheck,
}
@kelvinhammond I understand the problem with any
but I'm far too incompetent to be able to discuss the solution with you. The overall approach is to be strict as much as possible, therefore our move a few years ago to typescript and my recommendation to use the types if possible. But if this is blocking you, I guess (like you looks like) we should move forward and open a dedicated ticket to handle this problem later. I would still stick as much as possible to the typescript formalism in general.
We'll still need typescript for mwoffliner, still trying to find something that works.
I figured something out, now for the tedious work of writing tests and defining classes in typescript.
@kelson42 I'll start on this most likely around October 27th, 2021. I don't think this will take too long to do
So that was a lie. PR here https://github.com/openzim/node-libzim/pull/72/files
@kelvinhammond To me it looks like you are making progress and things are on track even if this is widely more complex than you assessed at start. Right?
Yep, this version is pretty much done as far as I know so far.
@kelvinhammond Not that I believe that I could make a quality review of your code, but let me know explicitly please when you are over with your effort.
@kelson42 Its ready
Libzim7 has just been published and provides a lot of improvements. But the API has slightly changed, so we need to adapt our code. Probably also a good opportunity to clean a few things in the CI and the doc.