kimbauters / ZIMply

An easy to use offline reader for ZIM files right in your browser!
Other
80 stars 16 forks source link

The `ZIMClient.random_article` function also returns media files with new zim files #26

Open dylanmccall opened 2 years ago

dylanmccall commented 2 years ago

With new-style zim files, both articles and their assets appear in the same "C" namespace:

https://www.openzim.org/wiki/ZIM_file_format#Namespaces

ZIMClient.random_article chooses a random index from the "C" namespace, assuming that entries in that namespace are all articles. This means that it will often return images, for example, instead of articles.

The issue is particularly prominent with this zim file, for example, which contains a very large number of images: https://download.kiwix.org/zim/.hidden/endless/wikihow_en_endless_holidays-and-traditions_2021-12.zim.

For reference, here is how this is implemented in libzim: https://github.com/openzim/libzim/blob/master/src/archive.cpp#L267-L284. It looks like we would need to make use of the title index.

kimbauters commented 2 years ago

Sorry for the delay in getting back on this. Since this is only a convenience feature it is probably not something I will look at in the short term, even more so because it is only part of the ZIMply core and not of the ZIMply server.

Your references to libzim definitely help if anyone wants to suggest a fix. I had a (brief) look but couldn't immediately identify how to resolve the issue. My C++ knowledge is way too Rust-y so others may have better insights.