Correct Language codes in Gutenberg recipes

RavanJAltaie commented 4 months ago

For Gutenberg, we use the "one-language-one-zim" mode in Zimfarm. In this mode, the language is set automatically by the scraper. Obviously the scraper is creating ZIMs with improper language => open upstream issue in Gutenberg scraper, nothing you can solve yourself.

there are two issues:

openZIM:gutenberg_mul_all is improper ZIM name, mul language is not a valid ISO-639-3 language code
openZIM:gutenberg_rmr_all is improper ZIM name, rmr language is not a valid ISO-639-3 language code anymore ; as of 2010-01-18, [rmr] for Caló is deprecated due to split. split into Caló [rmq] and Erromintxela [emx]

Edit:

openZIM:gutenberg_mul_all:
- ZIM name is OK
- ZIm filename is OK
- ZIM language is KO because mul language is not a valid ISO-639-3 language code, it must be a csv list of ISO-639-3 sorted by importance (so number of entries here)
openZIM:gutenberg_rmr_all:
- rmr language is not a valid ISO-639-3 language code anymore ; as of 2010-01-18, [rmr] for Caló is deprecated due to split. split into Caló [rmq] and Erromintxela [emx]
- ZIM name must be updated (to rmq probably)
- ZIM filename also
- ZIM language must be updated as well, could be rmq or rmq,emx
- might be solved upstream (Gutenberg)

eshellman commented 4 months ago

I can see about Caló (it's only one book) from upstream, but none of the others are language codes from PG, that I know of.

benoit74 commented 4 months ago

Thank you @eshellman, if you could fix rmr upstream it would be great ; otherwise we would have to add a "hack" to our scraper to transform rmr into rmq,emx since it's probably the real situation, or maybe only rmq

mul is a hack for the ZIM we create with all languages. The scraper should not do that to respect openZIM specification, and list all languages. This part is for us ^^

rgaudin commented 4 months ago

@benoit74 Languages metadata must be a list of ISO-639-3 sorted by importance (so number of entries here) but the Name metadata and the filename will keep the mul.

benoit74 commented 4 months ago

Languages metadata must be a list of ISO-639-3 sorted by importance (so number of entries here) but the Name metadata and the filename will keep the mul.

Yep, I had this in mind. Thank you for confirming before I even asked 😄

benoit74 commented 4 months ago

(and sorry for the wrong description in first comment, I wrote it too fast)

openzim / gutenberg

Correct Language codes in Gutenberg recipes #217