openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
330 stars 24 forks source link

Better auto-detection of multilanguage content #187

Open kelson42 opened 1 year ago

kelson42 commented 1 year ago

Currently the ZIM "Language" Metadata can automatically be filled with only one language. Zimit check it on the Welcome page and then set it. Even if the other pages are using other languages.

It would be better to check all the pages, gather the list of languages and then at the end, set the "Language" Metadata properly.

Follow comments on https://github.com/openzim/zimit/issues/186

rgaudin commented 1 year ago

I'm not sure about this. I think what you propose will decrease quality while we already have quality issues with zimit.

The goal of this metadata is to inform users about the main languages in use in the ZIM so he can filter it in/out. It's not a technical one like the Counter which exhaustively lists all content types.

I'm afraid we'll often end up with several languages that are meaningless to the ZIM… while being time consuming (parsing all HTML entries) and while only reporting HTML languages and not the one of say PDF files for instance.

It should be set manually because that's what's best. Even a person foreign to the website can visit it and under 30s find out what the main languages are.

Now we have a shortcut that uses the main page's language because that's the most frequent use case.

I propose we make the language param mandatory and add a special handling for the homepage value which will use the homepage's language. We could even set homepage as default value in youzim.it's form.

Independently of this, warc2zim should allow specifying multiple languages which it doesn't at the moment.