openzim / node-libzim

Libzim binding for Node.js: read/write ZIM files in Javascript
https://www.npmjs.com/package/@openzim/libzim
GNU General Public License v3.0
26 stars 10 forks source link

Re-architecture the node binding #36

Open kelson42 opened 4 years ago

kelson42 commented 4 years ago

The current node binding does not respect the libzim reader/writer patterns. This should be rewritten to respect it (like the the python-libzim binding does).

After this is doing, we could probably provide high level abstraction of it in a dedicated package (like the python_scraperlib) to simplify the life of scraper devs.

kelson42 commented 4 years ago

@mgautierfr I would be happy if you could give a few guidelines/things to respect to do that work properly.

mgautierfr commented 4 years ago

First, you have to consider the reading part and the writing part as two different libraries. They are in the same repository and share few common structure name (as Article) but they are totally different (in usage and code behind).

About blob.

Blob is a proxy to a content store somewhere. It mostly contains a pointer to the data and a size. It is cheap to copy.

On reading side, the wrapper should avoid to copy the data pointed by the blob. If it is not possible, the data must be copied the later possible, and only when actually needed. The best would be to "convert" the blob to the equivalent in node world (I don't know if it is possible, we have memoryview in python for this). If the data is not copied but referenced, the wrapper must ensure that the pointed data is not deleted while node is using it. To do so, it is enought to keep a copy of the blob object somewhere as long as node is using (may use) the pointed data.

When creating zim file, the wrapper must create a blob (when needed) pointing to the data and pass it to c++. You must ensure that the data is not deleted by node as long as c++ use the blob. (You are sure that the blob will not live longer than the article. So it is enough, and easier, to ensure that the pointed data existing as long as the article exists).

kelvinhammond commented 4 years ago

@mgautierfr I have a few questions about libzim's architecture regarding this.

  1. Why aren't zim::Article and zim::writer::Article sharing a common subclass such that zim::writer::Article can be polymorphic and interfaces can simply use zim::Article *?
  2. Why is there no way to create a zim::Blob and have it manage the data stored internally instead of an external pointer to the data? Internally libzim can do this but externally one cannot.
  3. There is a way, if I recall correctly, to store data in napi and have a callback function handle its deletion if this is useful.
mgautierfr commented 4 years ago

Why aren't zim::Article and zim::writer::Article sharing a common subclass such that zim::writer::Article can be polymorphic and interfaces can simply use zim::Article *?

Because they are nothing in common (even if few methods have the same name). When you use inheritance, B inherits A, so B is a specific A. Then you can pass a B to a code waiting for a A. Here zim::writer::Article is not a a specific zim::Article (libzim cannot give you a writer article when you are asking for a specific article of a zim file). The same way, zim::Article is not a zim::writer::Article, they are missing methods (shouldIndex, shouldCompress, getRedirectUrl) and it make no sense to add theme to a (reading) article.
About an intermediate subclass, well, what would it be ? zim::writer::Article is just a collection of methods to implement, there is no code. And code of zim::Article is really specific to reading. Have a look for the getData implementation (https://github.com/openzim/libzim/blob/master/src/article.cpp#L198-L205). How this could be reuse by a writer article ?

Why is there no way to create a zim::Blob and have it manage the data stored internally instead of an external pointer to the data? Internally libzim can do this but externally one cannot.

This is a flaw in the API. I realized that when working with python wrapper. You can't create your own Blob managing its own data internally. (To be exact, you can create it (inherit from it), but you cannot pass it to the creator). This will probably be changed with https://github.com/openzim/libzim/issues/325. The signature of getData will be to return a pointer to a blob. The reading entry will return a specific blob using data in a zim file. On the writing side, user will have to create its own blob and manage the data the way specific to the usecase.

There is a way, if I recall correctly, to store data in napi and have a callback function handle its deletion if this is useful.

Yes. It is probably this system that should be use once we will fix the issue with the blob. The deletion of the data would be tied to the existance of the blob (or one of its copy). When the last copy of the blob is deleted, then napi would delete the data.

kelson42 commented 4 years ago

@kelvinhammond Would be happy to invite you to our Slack channel if you are interested.... because part of the discussion around node-libzim and MWoffliner just happen there. Just write me an email to kelson at kiwix . org to get an invitation.

kelson42 commented 2 years ago

@kelvinhammond To which extend #72 fixes this ticket?

kelvinhammond commented 2 years ago

@kelson42 You're still going to have issues with the copying, I haven't figured out a way to avoid that in nodejs. We now have the threading issue where we can't easily create custom content creators.

kelson42 commented 1 year ago

@kelvinhammond To my understanding, it looks like you have done most of the work requested here. If so, I woukd recommend to open dedicated tickets for the parts still to do and close this one. Do you agree?

kelson42 commented 1 year ago

@kelvinhammond Any feedback here please? Now that libzim 8.2.0 has been released, I will push the release of libzim-nodejs 3.0.0 (finally).

kelson42 commented 7 months ago

@kelson42 You're still going to have issues with the copying, I haven't figured out a way to avoid that in nodejs. We now have the threading issue where we can't easily create custom content creators.

@kelvinhammond We have #80 for the threading issue, but if you could open a ticket for the copying issue, I would be glad to close this old ticket which has been implemented to 99% thanks to your work.

kelson42 commented 2 weeks ago

@audiodude Issue probably worth a read