kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
5 stars 0 forks source link

Cache ZIM metadata on library-gen? #209

Open rgaudin opened 6 days ago

rgaudin commented 6 days ago

Currently, the library generator script which is used both for library and dev-library (different source folders) spends most of its time reading metadata from ZIM files on the filesystem.

On library, this is ~6,800 files. This can be completed within ~6mn but if the disk is busy (reminder: the server is using mechanical drives), this can take 3 hours.

This script is ran every 30mn on library and every 10mn for dev-library.

While this will all be obsolete once the CMS takes over, a quick and easy improvement would be to cache this information and only read metadata for new files. It's actually already cached (in previously written library xml) so it's just a matter of skipping/reusing data for existing entries.

The only drawback is that it wont update metadata of a file that has been overwritten but that's already a scenario we've excluded and we could implement a simple file-flag that triggers a full re-read if present.

kelson42 commented 6 days ago

I don't think we should do anything in the meantime (before CMS is published) but otherwise I would recommend to save in the libary a kind of publishing date (see this comment: https://github.com/kiwix/libkiwix/issues/702#issuecomment-1030867364) which would be the same as the ZIM file last modified date. Based on the comparison, I would use the last library.xml as cache if the file has not been renewed.

rgaudin commented 6 days ago

Yes, the problem being the XML file is public so we should not come up with anything ourselves and wait for that libkiwix ticket first…

Let's keep that ticket open as an option until the CMS arrives or something else pressures us to do it.