Closed benoit74 closed 2 weeks ago
Do we have a way to know at ZIM-creation time exactly which languages are in the ZIM? I'm thinking it might be complicated to determine in heavily multilingual cases, and also to map whatever ISO code scheme the original website uses (e.g. BCP47, like de-AT
= Austrian German) to the ISO-639-3 that is used for the language metadata, IIRC.
A copout, but maybe practical, would be to use mul
as well in such cases.... rather than going to the complication of making the language field a list.
@Jaifroid the goal of this metadata is to help the Catalog user filter/find/select content, not to exhaustively list the languages used. In that sense, mul
is useless and it's not even allowed.
I dont think that automating this is a good idea. Scenarios for not specifying this rare and in that case, I believe that the language of the seedUrl is probably a good trade-off.
@rgaudin Ah, OK, that makes sense. We seem to have several ZIMs labelled _mul_
in the filename, which made me think it was an allowed language code here as well.
We do use mul
in filenames but the metadata is the list of Languages. We did use mul
as Language code in some ancient ZIMs.
Also, we use mul
in readers to refer to multi-languages ZIMs, like the language select box in kiwix-serve, but it's just UI, the metadata remains a list of actual codes.
Note that we are now discouraging the use of multiple languages per ZIM as this proved to be rarely useful and a tempting shortcut to poor quality ZIMs. Ability remains but openZIM would create less of them for this reason and the fact that the rest of the metadata (title, description, tags) are always in a single language.
The goal of this issue is more to allow manually specifying the list of languages the editor expect to be present in the ZIM. It might not be totally exhaustive in some edge cases, but I don't believe much in the ability to automate this (we see that even the main language is usually not automatically retrieved, probably because not that reliable).
If we agree that we prefer to avoid creating mul ZIMs as much as possible, then maybe not supporting this in the crawler is an interesting solution ^^
@benoit74 At this stage, seems a priority (for lowtech mag for example) AND easy to implement.
Just edited first comment to fix that value is comma ,
separated, just like the Language metadata. It's the tags which are semi-comma ;
separated
Currently, it is not possible to specify multiple languages in ZIM metadata.
We need support for the
--lang
parameter to be a list of ~semi-colon;
~ comma,
separated language codes.