openzim / cms

ZIM file Publishing Platform
https://cms.openzim.org
GNU General Public License v3.0
4 stars 0 forks source link

Don't push a zim file if the test failed / implement Quarantine #10

Open jmontleon opened 3 years ago

jmontleon commented 3 years ago

Problem

The task generated a garbage zim file. The test results in the log backs this up, as does the size of ~1MB instead of the typical ~45MB. If the test failed it probably shouldn't be published/made available.

If I follow the link https://download.kiwix.org/zim/archlinux_en_all_maxi.zim this is what I get though. I guess kiwix is probably just looking at the latest available copy.

[INFO] Checking zim file /data/archlinux_en_all_nopic_2021-05.zim
[INFO] Verifying ZIM-file structure integrity...
[INFO] Avoiding redundant checksum test (already performed by the integrity check).
[INFO] Searching for metadata entries...
[INFO] Searching for Favicon...
[INFO] Searching for main page...
[INFO] Verifying Articles' content...
[INFO] Searching for redundant articles...
  Verifying Similar Articles for redundancies...
[ERROR] Invalid internal links found:
  The following links:
- Arch_Linux
(A/Arch_Linux) were not found in article A/Main_page
[INFO] Overall Test Status: Fail
[INFO] Total time taken by zimcheck: 0 seconds.

Reproducing steps

This zim has broken occasionally before but it seems like a transient issue that usually gets resolved on the next build.

rgaudin commented 3 years ago

Thank you @jmontleon for your report. We actually have that in place already in our receiver code. See here but as you can see here it is currently disabled. We disabled it at some point because it was creating a bottleneck but maybe it's OK to bring it back… @kelson42 ?

Ultimate goal is to stop relying on this. Zims will only be published by the CMS (to come) if satisfying defined criteria (zimcheck status from the zimfarm being a source).

kelson42 commented 3 years ago

@rgaudin @jmontleon The whole problem is known indeed and the plan is clear: we will develop (actually it should already be online if we would not be late!). The CMS should check the zimcheck json output (available in next release) and based on threshold decided to let go through the quarantine or not.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.