openzim / zim-tools

Various ZIM command line tools
https://download.openzim.org/release/zim-tools/
GNU General Public License v3.0
123 stars 34 forks source link

What to test with zimcheck and how ? #340

Open mgautierfr opened 1 year ago

mgautierfr commented 1 year ago

This issue is created before this interesting discussion about what to test started in #339 becomes too big and cannibalizes the PR review.

Initially from @rgaudin:

I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none.

Couple of comments (already identified):

Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid?

I join @kelson42 in thinking we want basic checks for now that we could extend in the future.

And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs.

Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard.

mgautierfr commented 1 year ago

Then by @holta

Very thoughtful response from @rgaudin, and a big thank you to all building_construction

I strongly support organic and free-form metadata standards (what's needed are strong norms and strong guidelines not bureaucratic rules) that allow grassroots initiatives to collaborate & innovate efficiently.

In fact even semi-structured data sometimes has an extremely valuable place along the way — thereby empowering regional and specialized communities to build their own ZIM files, with the metadata that their region/profession/culture truly needs.

For this reason I very strongly support allowing "free-form metadata fields" that not only permit but encourage grassroots (not centralized) community innovation to truly flourish.

Then later on, as strong community norms are independently nurtured + demonstrated + proven year-by-year-by-year, the world should honor those great grassroots practices — as they become more official metadata standards.

Central authorities (Kiwix) should provide basic guardrails & guidelines of course, but that's sufficient +1

Thank you to everyone including @veloman-yunkan and @kelson42 and @mgautierfr working very hard on this critical question, helping it to evolve quickly in coming years, and every step of the way.

mgautierfr commented 1 year ago

This is a interesting question.

I mainly see two kinds of testing:

The first one regroups all tests that are technically mandatory. The exact definition is subject to discussion, but at first glance, I would say that a failing test in this category would make libzim raised a exception at a moment. I can think of:

In the second group, I would put all other tests that may be good to have (for better quality) but not mandatory:

I would say that the first group are error when the second group are warning. But nothing prevent us to have a option as -Werror to treat all warnings as errors when user want to be pedantic (us in zimfarm)