openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Support for enabling full-text index and compression on demand #88

Closed parvit closed 1 year ago

parvit commented 1 year ago

This PR responds to issue openzim/sotoki/issues/243.

Disables the full text indexing and compression by default so that it's memory cost is only payed if requested (which can be an issue with big sites).

rgaudin commented 1 year ago

As explained in https://github.com/openzim/sotoki/issues/243#issuecomment-1194314604 I believe this is not needed as it is possible to disable it. Please confirm and we'll close this PR.

rgaudin commented 1 year ago

Checked ; works as expected.

parvit commented 1 year ago

Very well, just know however that if you pass the program with memray you'll still see the memory cost associated even when disabling it (because the defaults of the library still have effect).

rgaudin commented 1 year ago

Very well, just know however that if you pass the program with memray you'll still see the memory cost associated even when disabling it (because the defaults of the library still have effect).

Compression and indexing are both handled by libzim and only take action after the call to start(). Defaults only call those config_* methods which can be called several times before start.

As stated above, I've tested that it works as expected: by calling .config_compression and .config_indexing to disable both I end up with an uncompressed ZIM that does not include the full-text index.

parvit commented 1 year ago

yes i did not get that you would try by calling the methods directly.

the point i wanted to convey is that without explicitly invoking the disabling of both options you will get the cost, it is not enough to just not calling it at all.

Il Mer 27 Lug 2022, 12:02 rgaudin @.***> ha scritto:

Very well, just know however that if you pass the program with memray you'll still see the memory cost associated even when disabling it (because the defaults of the library still have effect).

Compression and indexing are both handled by libzim http:///openzim/python-libzim and only take action after the call to start(). Defaults only call those config_* methods which can be called several times before start.

As stated above, I've tested that it works as expected: by calling .config_compression and .config_indexing to disable both I end up with an uncompressed ZIM that does not include the full-text index.

— Reply to this email directly, view it on GitHub https://github.com/openzim/python-scraperlib/pull/88#issuecomment-1196525114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4ABMUFLB5VHV5CSDXDVWECLLANCNFSM54S2GHIA . You are receiving this because you authored the thread.Message ID: @.***>

rgaudin commented 1 year ago

That's right. As @kelson42 said somewhere, this is the wanted behavior for 99.9% or our users. Compression and indexing should not have a significant impact on memory and if it does it may be a bug in libzim. I'll get to testing sotoki without both in the coming days so we can have a way forward.

I think the key here is that zimscraperlib.zim.creator.Creator inherits from libzim.writer.Creator so API might seem smaller than it actually is.