openzim / zim-tools

Various ZIM command line tools
https://download.openzim.org/release/zim-tools/
GNU General Public License v3.0
123 stars 34 forks source link

Centralized metadata constraints #339

Closed veloman-yunkan closed 1 year ago

veloman-yunkan commented 1 year ago

This PR addresses https://github.com/openzim/zim-tools/issues/334#issuecomment-1476307896

Fixes #334 Fixes #336

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 70.63% and project coverage change: +1.79 :tada:

Comparison is base (3dd2f59) 43.79% compared to head (d32037c) 45.58%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #339 +/- ## ========================================== + Coverage 43.79% 45.58% +1.79% ========================================== Files 23 26 +3 Lines 2393 2481 +88 Branches 1326 1380 +54 ========================================== + Hits 1048 1131 +83 - Misses 1329 1334 +5 Partials 16 16 ``` | [Impacted Files](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim) | Coverage Δ | | |---|---|---| | [src/zimcheck/checks.h](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL3ppbWNoZWNrL2NoZWNrcy5o) | `100.00% <ø> (ø)` | | | [src/zimwriterfs/zimwriterfs.cpp](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL3ppbXdyaXRlcmZzL3ppbXdyaXRlcmZzLmNwcA==) | `0.00% <0.00%> (ø)` | | | [src/metadata.h](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL21ldGFkYXRhLmg=) | `50.00% <50.00%> (ø)` | | | [src/metadata.cpp](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL21ldGFkYXRhLmNwcA==) | `98.70% <98.70%> (ø)` | | | [src/metadata\_constraints.cpp](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL21ldGFkYXRhX2NvbnN0cmFpbnRzLmNwcA==) | `100.00% <100.00%> (ø)` | | | [src/zimcheck/checks.cpp](https://codecov.io/gh/openzim/zim-tools/pull/339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL3ppbWNoZWNrL2NoZWNrcy5jcHA=) | `96.56% <100.00%> (-0.01%)` | :arrow_down: | ... and [1 file with indirect coverage changes](https://codecov.io/gh/openzim/zim-tools/pull/339/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

veloman-yunkan commented 1 year ago

Inviting @mgautierfr @kelson42 to review the overall approach for centralizing all metadata checks in one place. Please look at the file metadata_constraints.cpp (note that it is not compiled separately, but is rather #included in metadata.cpp.

Issues noticed so far:

kelson42 commented 1 year ago

regex may be a convenient tool as a lowest common denominator for validating various strings, however it falls short for some types of data if we need more rigorous checks. In our case, validating the date with a simple regex will allow entering invalid dates (e.g. 30th of February). This can be fixed by changing the last field in the metadata info table from a regexp to a special polymorphic validator object - a concrete implementation of thereof can be created from a regular expression. Do we want to do it now?

@veloman-yunkan Thank you very much for your PR which is, at a first sight, really much the kind of code I was expected.

Indeed regexs fall short in many cases. The case I have in mind is checking illustration entries (requires libpng).

I'm not in favour of extending much the validation scenarios because this can go endlessly and my goal is to have the basic ones.

But I would appreciate to have the appropriate code infrastructure so we can easily do that work later... without rearchitecturing effort.

kelson42 commented 1 year ago

On my local computer, it does not compile:

FAILED: test/metadata-test.p/.._src_metadata.cpp.o 
ccache c++ -Itest/metadata-test.p -Itest -I../test -I/usr/local/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -Werror -std=c++11 -g -Werror -Wall '-DVERSION="3.1.3"' -DGTEST_HAS_PTHREAD=1 -pthread -MD -MQ test/metadata-test.p/.._src_metadata.cpp.o -MF test/metadata-test.p/.._src_metadata.cpp.o.d -o test/metadata-test.p/.._src_metadata.cpp.o -c ../src/metadata.cpp
../src/metadata.cpp: In static member function ‘static const zim::Metadata::ReservedMetadataRecord& zim::Metadata::getReservedMetadataRecord(const string&)’:
../src/metadata.cpp:100:14: error: ‘out_of_range’ is not a member of ‘std’
  100 |   throw std::out_of_range(name + " is not a reserved metadata name");
      |              ^~~~~~~~~~~~
../src/metadata.cpp:101:1: error: control reaches end of non-void function [-Werror=return-type]
  101 | }
      | ^
cc1plus: all warnings being treated as errors
[41/48] Compiling C++ object test/tools-test.p/.._src_zimwriterfs_zimcreatorfs.cpp.o
ninja: build stopped: subcommand failed.
rgaudin commented 1 year ago

I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none.

Couple of comments (already identified):

Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid?

I join @kelson42 in thinking we want basic checks for now that we could extend in the future.

And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs.

Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard.

holta commented 1 year ago

Very thoughtful response from @rgaudin, and a big thank you to all 🏗️

I strongly support organic and free-form metadata standards (what's needed are strong norms and strong guidelines not bureaucratic rules) that allow grassroots initiatives to collaborate & innovate efficiently.

In fact even semi-structured data sometimes has an extremely valuable place along the way — thereby empowering regional and specialized communities to build their own ZIM files, with the metadata that their region/profession/culture truly needs.

For this reason I very strongly support allowing "free-form metadata fields" that not only permit but encourage grassroots (not centralized) community innovation to truly flourish.

Then later on, as strong community norms are independently nurtured + demonstrated + proven year-by-year-by-year, the world should honor those great grassroots practices — as they become more official metadata standards.

Central authorities (Kiwix) should provide basic guardrails & guidelines of course, but that's sufficient 👍

Thank you to everyone including @veloman-yunkan and @kelson42 and @mgautierfr working very hard on this critical question, helping it to evolve quickly in coming years, and every step of the way.

mgautierfr commented 1 year ago

I have open a issue #340 to discuss what to test and how. Please continue the discussion there and keep this discussion for PR review.

kelson42 commented 1 year ago

I just want to emphasis again that we need this basic metadata feature pretty quickly. Thank you for having open side tickets to continue the interesting discussion and keep this PR "simple" without "insulting" the future.

kelson42 commented 1 year ago

@veloman-yunkan I hope you got all the feedbacks needed, let us know othwerwise.

veloman-yunkan commented 1 year ago

With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?

  1. Most metadata being textual, it's better to specify the metadata size limits in characters and not impose any size constraints on illustration metadata. 1.5 Like above, but handle illustration metadata as a special case in the C++ code and interpret its size limits in bytes.
  2. Specify size limits in bytes. Then languages with non-latin scripts will be penalized with lower human-comprehensible upper limits on the length of the description metadata.
rgaudin commented 1 year ago

With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?

  1. We'd only check the text length already in the spec for the couple that have limits. Illustration have no such constraint in the spec so we'd leave them out for now.
veloman-yunkan commented 1 year ago

A preliminary version of this PR (reflecting feedback received on the draft version) is ready. Note that the PR doesn't utilize the new Metadata utility in any of the zim tools (zimcheck, zimwriterfs).

kelson42 commented 1 year ago

@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck.

kelson42 commented 1 year ago

@mgautierfr any feedback/review?

veloman-yunkan commented 1 year ago

How do you plan to implement checkTextLanguage ? Or it is just a sample and it have to be removed before merge ?

In this PR checkTextLanguage() was introduced only to illustrate support for complex constraints. I will get rid of it before merging.

mgautierfr commented 1 year ago

In this PR checkTextLanguage() was introduced only to illustrate support for complex constraints. I will get rid of it before merging.

Ok. You can remove it right now (or comment it to not loose the illustration purpose) while you fix the issue with last pointer.

veloman-yunkan commented 1 year ago

@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck.

I added usage of this new utility in zimwriterfs. As a result a few issues (including two preexisting bugs related to the --longDescription option of zimwriterfs) were uncovered during testing. I will finalize this PR tomorrow.

veloman-yunkan commented 1 year ago

The PR is almost ready. zim::Metadata is now used both in zimwriterfs and zimcheck. However, the metadata constraints were relaxed in order to let the zimcheck unit tests pass with existing test ZIM data. Once this version is "approved", I will fix/regenerate the test data and restore the metadata constraints to the state corresponding to the current spec.

veloman-yunkan commented 1 year ago

Note that metadata constraints were relaxed in order to let the zimcheck unittests pass without updating/regenerating the unittest ZIM data.

How complex it is to regenerate the unittest data ?

The regeneration itself should be easy. The question is whether I should use the current release of zimwriterfs or the one used to create the current version of test ZIM files.

mgautierfr commented 1 year ago

The regeneration itself should be easy. The question is whether I should use the current release of zimwriterfs or the one used to create the current version of test ZIM files.

I think here this is not relevant (at least for the change we want to made). But the question is not easy to answer for generic case. The solution is probably in creating a full test suite with zim files created/tested with different version (https://github.com/openzim/zim-testing-suite)

veloman-yunkan commented 1 year ago

The PR is final (though needs at least a rebase&fixup unless other issues are spotted).

Known issues:

kelson42 commented 1 year ago

@veloman-yunkan For now, I'm happy with this kind of regex error.

mgautierfr commented 1 year ago

Please rebase-fixup and we can merge.

veloman-yunkan commented 1 year ago

Please rebase-fixup and we can merge.

Done

mgautierfr commented 1 year ago

Regexp constraints can duplicate length constraints. For example for Language metadata the following pair of errors is reported when the language is specified using a 2-char code:

  • Language must contain at least 3 characters
  • Language doesn't match regex: \w{3}(,\w{3})*

I think this is a problem of coherence between checks and it can be fixed later (and we probably have to update the checks). For now the structure is good and we can merge this long PR.