openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Add support for multiple languages in `--lang` / ZIM metadata #300

Closed benoit74 closed 2 weeks ago

benoit74 commented 3 weeks ago

Currently, it is not possible to specify multiple languages in ZIM metadata.

We need support for the --lang parameter to be a list of ~semi-colon ;~ comma , separated language codes.

Jaifroid commented 3 weeks ago

Do we have a way to know at ZIM-creation time exactly which languages are in the ZIM? I'm thinking it might be complicated to determine in heavily multilingual cases, and also to map whatever ISO code scheme the original website uses (e.g. BCP47, like de-AT = Austrian German) to the ISO-639-3 that is used for the language metadata, IIRC.

A copout, but maybe practical, would be to use mul as well in such cases.... rather than going to the complication of making the language field a list.

rgaudin commented 3 weeks ago

@Jaifroid the goal of this metadata is to help the Catalog user filter/find/select content, not to exhaustively list the languages used. In that sense, mul is useless and it's not even allowed.

I dont think that automating this is a good idea. Scenarios for not specifying this rare and in that case, I believe that the language of the seedUrl is probably a good trade-off.

Jaifroid commented 3 weeks ago

@rgaudin Ah, OK, that makes sense. We seem to have several ZIMs labelled _mul_ in the filename, which made me think it was an allowed language code here as well.

rgaudin commented 3 weeks ago

We do use mul in filenames but the metadata is the list of Languages. We did use mul as Language code in some ancient ZIMs. Also, we use mul in readers to refer to multi-languages ZIMs, like the language select box in kiwix-serve, but it's just UI, the metadata remains a list of actual codes.

Note that we are now discouraging the use of multiple languages per ZIM as this proved to be rarely useful and a tempting shortcut to poor quality ZIMs. Ability remains but openZIM would create less of them for this reason and the fact that the rest of the metadata (title, description, tags) are always in a single language.

benoit74 commented 3 weeks ago

The goal of this issue is more to allow manually specifying the list of languages the editor expect to be present in the ZIM. It might not be totally exhaustive in some edge cases, but I don't believe much in the ability to automate this (we see that even the main language is usually not automatically retrieved, probably because not that reliable).

If we agree that we prefer to avoid creating mul ZIMs as much as possible, then maybe not supporting this in the crawler is an interesting solution ^^

kelson42 commented 3 weeks ago

@benoit74 At this stage, seems a priority (for lowtech mag for example) AND easy to implement.

benoit74 commented 2 weeks ago

Just edited first comment to fix that value is comma , separated, just like the Language metadata. It's the tags which are semi-comma ; separated