openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
83 stars 25 forks source link

Create sanity-check Zimfarm image #58

Closed kelson42 closed 5 years ago

kelson42 commented 5 years ago

From @Popolechien on November 4, 2018 9:20

I'm looking at http://library.kiwix.org/granbluefantasy_en_all_all_nopic_2018-10/ (the last release of Granblue fantasy wiki) and it is obvious that a bunch of things are broken, rendering the file unusable and a waste of data/download time for users. It seems to be a rather recent addition to the library, so can we think of some simple confirm/vetting process (a.k.a. Quality control) before adding new zims?

Copied from original issue: openzim/mwoffliner#422

kelson42 commented 5 years ago

@Popolechien I think I have a ticket for that somewhere... but we need definitly to setup an quality insurance system. The idea is to add this validation step one time the files are uploaded to the warehouse.

@automactic The docker itself should be really is: monitor a directory, check the new files with zim-check, if returns no error, then move the file to make it really available to download. Otherwise "to be defined".

automactic commented 5 years ago

@kelson42, the zimfarm warehouse is a SFTP server. It cannot do stuff like monitor dir and test new files.

What might be a good idea is to introduce the concept of staging. SFTP server move files from workers to staging, then a dedicated monitor will kick off testing jobs for new files in staging. After tests passed, move them to production.

automactic commented 5 years ago

@kelson42 How are we planning to test zim files? For situation like above, there doesn't seems to be an obvious way to automate the test

Popolechien commented 5 years ago

@automactic We'll need to have a human step in there. For these wikis I'm also thinking of contacting the mods to ask if they'd be ok with us having a simplified landing page (like we already do on a few Wikipedia).

kelson42 commented 5 years ago

@Popolechien It seems to me quite unrealistic, because of human resource bottleneck, to have a human review of many thousands of new ZIM a month. On the top of that, this is something which can be automated, so for us, probably something we could/should do

@automactic I do not have talked about the "warehouse" container. IMO the warehouse container is fine to receipt the ZIM files from the distributed workers. Just take care that we have a way to easily know if a file is fully uploaded or not on the fs. We need that because one time a file will be uploaded, the "sanity check" container (still to build) will run zim-check against that file and then move it to final destination. To conclude the warehouse and the sanity-check container will share a Docker volume.

Popolechien commented 5 years ago

@kelson42 Of course not every ZIM, but just the new ones for their very first deployment. I don't know how many new contents we publish yearly, but I'd be surprised at this stage that it's more than a handful.

kelson42 commented 5 years ago

@Popolechien OK, then I do that already.

kelson42 commented 5 years ago

This will be handled outside the zimfarm project. See https://github.com/kiwix/maintenance/issues/30