openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 22 forks source link

[PRIO] Tags Metadata truncated #139

Closed rgaudin closed 2 years ago

rgaudin commented 2 years ago

Here's an example with this small ZIM: gutenberg_he_all_2022-04.zim

In [1]: from libzim.reader import Archive

In [2]: zim = Archive("/Users/reg/Downloads/gutenberg_he_all_2022-04.zim")

In [3]: zim.get_metadata("Tags")
Out[3]: b'_category:gutenberg;gutenberg;_ftindex:yes;_ftindex'

Now, using libkiwix:

kiwix-manage xxx.xml add gutenberg_he_all_2022-04.zim

And here's the (formatted, favicon removed) output

<library version="20110515">
  <book id="f4798cdd-47c2-b247-6493-d0e2f656fdc4"
    path="gutenberg_he_all_2022-04.zim"
    title="Project Gutenberg Library (he)"
    description="The first producer of free ebooks"
    language="heb"
    creator="gutenberg.org"
    publisher="Kiwix"
    name="gutenberg_he_all_2022-04"
    tags="_category:gutenberg;gutenberg;_ftindex:yes;_ftindex:yes;_pictures:yes;_videos:yes;_details:yes"
    faviconMimeType="image/png"
    favicon="[snip]"
    date="2022-04-21"
    articleCount="22"
    mediaCount="41"
    size="4705" />
</library>

You'll notice that _ftindex:yes is repeated but AFAIK libzim doesn't care about the content of metdata…

@mgautierfr please take a look ; this is accidentally blocking a lot of stuff on my side.

kelson42 commented 2 years ago

@mgautierfr Really important to release soon next 1.1.0 release with a fix for that. The stability and performance of library.kiwix.org depends of it.

mgautierfr commented 2 years ago

I'm not sure about what your complaining exactly:

How is it a blocker for you ?

rgaudin commented 2 years ago

Ah! I didn't know that libkiwix added those tags. This solves this frightening mystery.

How is it a blocker? The central XML library used to be generated using kiwix-manage. It is now generated by a pylibzim-based script but we had a lot of different entries for the same content.

I imagine some readers may use those tags so I'll port that feature to the script (in scraperlib I suppose).

Thanks for the answer ; we knew it would be something obvious but I didn't expect this 😉

kelson42 commented 2 years ago

Zimdump should better ne used for inspwcting a ZIM.

mgautierfr commented 2 years ago

BTW, you probably have a bug in the creator/scrapper as you don't put the right Tags in the zim file.

rgaudin commented 2 years ago

Yes, I believe most non-mwoffliner scrapers don't specify all of those. I'll check all of them. We usually don't have flavours/filters but the ftindex tag might be missing.