When is content added to ZIM?

rgaudin commented 3 years ago

@mgautierfr @kelson42 it seems that when creating an Item, the feed() method of its ContentProvider is only called within finishZimCreation() and not when calling add_item().

Can you confirm that's the case?
Can you confirm that's the wanted behavior and the one of the libzim itself

With such a behavior, we can not use the typical scraper flow like:

with Creator(path) as c:
    for url in resources:
        fpath = download_video(url)
        post_process(fpath)
        p.add_item(StaticItem(path=fpath.name, filepath=fpath)
        os.unlink(fpath)  # would remove the resources and thus crash the finish step

Instead, we are forced to move all our scraper's logic inside an Item/ContentProvider subclass. Issues with this approach are:

it forces a way to (re)write the scrapers
it forces to treat all resources individually while it might not be convenient/efficient
it prevents us from skipping resources which failed to download/process
It puts all scraper logic behind the wrapper
- currently unusable due to the lack of exception handling
- once this is fixed, we'll only receive limited information on exception vs the complete python trace… even if the error is raised inside scraper-logic python code

Note: those constraints arise because we'd want to get rid of the resource data as we're getting those in order to save disk space.

mgautierfr commented 3 years ago

Yes, it is a consequence of multi thread to get/compress the content. We will add the item content to the zim when a free thread will handle the associated task. The item is not especially added when we call finishZimCreation (but we ensure/wait that all tasks are handled in finishZimCreation).

A ContentProvider is used only once. Once it has finished feed the content (the generator ends), it can remove the content, it will never been ask to feed again. (! But we may have several contentProvider created for the same item).

What is done synchronously when you add a item (at least for now):

Get some basic information about the item (path, title, mimetype) and create the dirent.
Create a contentProvider and create a "cluster" task using the contentProvider.
Create a indextask (keeping a reference to the item).

Then, later, on other thread:

"cluster" tasks is handled (get the content from the contentProvider) and add the content.
"index" task is handled :
- get indexData from the item (the default implementation create a contentProvider to get the data to index, so for indexed item, two contentProvider are created by default.)
- index the data and add it to the xapian database.

When tasks are handled, reference are dropped. On python, it means that the ref count may reach zero and python'gc will delete the object. You may use a tempfile (https://docs.python.org/3/library/tempfile.html) to get a fileobject and store the fileobject in the item. When it will be gc, the file will be removed automatically.

It is made this way to allow the libzim/scrapper to be efficient by downloading the data in the worker threads. But nothing prevent you to download the data first, store the data in the item (use a temp fileobject is a way to store data "in" the item) and then pass the item to the creator. Then, only the compression and indexing would be done in separated threads, getting the data would still be made in the main thread.

rgaudin commented 3 years ago

OK thanks for all the detail.

I guess the disk usage issue is non-existent then. We don't have to keep all the resources until finish(), we just can't manually control when to release them.

Thank you for mentioning the indexData behavior. I think we'll have to bring that up in the readme as this also have an impact on the implementation design. Could have made sense to retrieve data on the contentProvider if it's discarded on use but for indexable content, it would actually mean downloading data twice.

kelson42 commented 3 years ago

@rgaudin @mgautierfr I guess this is pretty clear now what to do, reassigning to reg.

openzim / python-libzim

When is content added to ZIM? #97