Open Popolechien opened 1 year ago
This is due to a ZIM with incorrect metadata in the library.
coopmaths.:
description: "Ressources libres pour la personnalisation des apprentissages en
math\xE9matiques"
id: 2081f380-2241-91ea-3b9e-b17b9538cdd7
langid: coopmaths.
language: ''
name: Coopmaths
sha256sum: e63ca2d238464c7dff7f9f721614d823d2fcb7ee5803268dd8461d5b8f304a13
size: 393975808
sw: y
type: zim
url: http://download.kiwix.org/zim/zimit/coopmaths_2022-12.zim
version: '2022-12-13'
Language
and the library/ideascube gen script might have removed it because it was not correct.It seems clear something has to be done somewhere. We should keep the ticket open as long as this is not clear.
Will check with zimit as we should not end up with a ZIM without Language. Actually I think the ZIM might have had an incorrect Language and the library/ideascube gen script might have removed it because it was not correct.
A ticket should be open immediatly in Zimit/Warc2zim and probably fixed soon. But We should have other checks IMO. I wonder for example if zimcheck detects this properl, etc...
I confirm the scraper does set the Language
tag but it would keep what the user provided if it was an incorrect ISO-339-3 code. Fixed in https://github.com/openzim/warc2zim/commit/cd69c6737e85030f6a755c5479917aa134cb672c
I am not sure if all scrapers do this kind of check ; certainly not zimwriterfs. I am not sure how much we want to enforce this neither. That's debatable.
One of the culprit here is library-to-offspot that would parse the code from kiwix library (to convert it to ISO-639-1) but there was en edge case and it resulted in an empty string. Fixed in https://github.com/kiwix/k8s/commit/5eb83e58a56ca13ccdbfe4d18f4b591d207f4487
We should definitely discuss the larger issue: how flexible should those metadata be: can a user enter an incorrect Language
metadata and what should our tools do about it. Keeping in mind that Language code can evolve (although very rare).
I don't think it's wise to invest time in hardening cardshop/hotspot given this will change with OPDS switch. Being fed a valid catalog is a fair assumption IMO. Fixing catalog would benefits multiple tools so it makes more sense.
Should we pause zimfarm recipes operations until we have a formal, validated training for users and maybe implement a couple protections in the code ?
Since mwoffliner is half-broken and zimit isn't really reliable, I'd suggest we limit new zim files to Youtube videos and the like.
Rule #1 would be, anyway, to point new recipes to dev
and only once validated that everything is OK, should it be moved to the actual repo. If the ZIM took a very long time to create, we can move the file from dev to prod once green lighted