openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 22 forks source link

When is content added to ZIM? #97

Closed rgaudin closed 3 years ago

rgaudin commented 3 years ago

@mgautierfr @kelson42 it seems that when creating an Item, the feed() method of its ContentProvider is only called within finishZimCreation() and not when calling add_item().

With such a behavior, we can not use the typical scraper flow like:

with Creator(path) as c:
    for url in resources:
        fpath = download_video(url)
        post_process(fpath)
        p.add_item(StaticItem(path=fpath.name, filepath=fpath)
        os.unlink(fpath)  # would remove the resources and thus crash the finish step

Instead, we are forced to move all our scraper's logic inside an Item/ContentProvider subclass. Issues with this approach are:

Note: those constraints arise because we'd want to get rid of the resource data as we're getting those in order to save disk space.

mgautierfr commented 3 years ago

Yes, it is a consequence of multi thread to get/compress the content. We will add the item content to the zim when a free thread will handle the associated task. The item is not especially added when we call finishZimCreation (but we ensure/wait that all tasks are handled in finishZimCreation).

A ContentProvider is used only once. Once it has finished feed the content (the generator ends), it can remove the content, it will never been ask to feed again. (! But we may have several contentProvider created for the same item).

What is done synchronously when you add a item (at least for now):

Then, later, on other thread:

When tasks are handled, reference are dropped. On python, it means that the ref count may reach zero and python'gc will delete the object. You may use a tempfile (https://docs.python.org/3/library/tempfile.html) to get a fileobject and store the fileobject in the item. When it will be gc, the file will be removed automatically.


It is made this way to allow the libzim/scrapper to be efficient by downloading the data in the worker threads. But nothing prevent you to download the data first, store the data in the item (use a temp fileobject is a way to store data "in" the item) and then pass the item to the creator. Then, only the compression and indexing would be done in separated threads, getting the data would still be made in the main thread.

rgaudin commented 3 years ago

OK thanks for all the detail.

I guess the disk usage issue is non-existent then. We don't have to keep all the resources until finish(), we just can't manually control when to release them.

Thank you for mentioning the indexData behavior. I think we'll have to bring that up in the readme as this also have an impact on the implementation design. Could have made sense to retrieve data on the contentProvider if it's discarded on use but for indexable content, it would actually mean downloading data twice.

kelson42 commented 3 years ago

@rgaudin @mgautierfr I guess this is pretty clear now what to do, reassigning to reg.