openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 8 forks source link

Multilanguage ZIM seem not handle properly #180

Closed kelson42 closed 3 months ago

kelson42 commented 3 months ago

Recent TED file: image

Unable to be found if English language filter: image

... although it should be possible to find it.

The "MUL" at the bottom left seems also a hint that something is wrong.

rgaudin commented 3 months ago

This is clearly a scraper/ZIM issue:

❯ curl https://dev.library.kiwix.org/raw/ted_mul_capitalism_2024-03/meta/Language
mul

But I believe it shouldn't be possible since #170 which isn't perfect (but should not write mul) and will be correct with #171. I believe this run isn't using it since there hasn't been a release since. @benoit74 can you check that I'm correct and close ?

benoit74 commented 3 months ago

Sorry for the long feedback, I wanted to check everything before making a wrong statement.

This is indeed mostly already covered by #170 and #171 which are not yet released, before that ZIM language metadata was ... crappy.

> curl https://library.kiwix.org/raw/ted_mul_capitalism_2024-03/meta/Scraper
ted2zim 2.1.0

With main, the Language metadata is now properly set to a CSV list of languages. However it is not sorted properly, we only set eng as first language if present. Definitely a hack, but covers most (all?) the ZIMs we produce.

The remaining part (sorting languages in proper order) has to be covered by https://github.com/openzim/ted/issues/172