Closed veloman-yunkan closed 1 year ago
Patch coverage: 70.63
% and project coverage change: +1.79
:tada:
Comparison is base (
3dd2f59
) 43.79% compared to head (d32037c
) 45.58%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Inviting @mgautierfr @kelson42 to review the overall approach for centralizing all metadata checks in one place. Please look at the file metadata_constraints.cpp
(note that it is not compiled separately, but is rather #include
d in metadata.cpp
.
Issues noticed so far:
regex may be a convenient tool as a lowest common denominator for validating various strings, however it falls short for some types of data if we need more rigorous checks. In our case, validating the date with a simple regex will allow entering invalid dates (e.g. 30th of February). This can be fixed by changing the last field in the metadata info table from a regexp to a special polymorphic validator object - a concrete implementation of thereof can be created from a regular expression. Do we want to do it now?
@veloman-yunkan Thank you very much for your PR which is, at a first sight, really much the kind of code I was expected.
Indeed regexs fall short in many cases. The case I have in mind is checking illustration entries (requires libpng).
I'm not in favour of extending much the validation scenarios because this can go endlessly and my goal is to have the basic ones.
But I would appreciate to have the appropriate code infrastructure so we can easily do that work later... without rearchitecturing effort.
On my local computer, it does not compile:
FAILED: test/metadata-test.p/.._src_metadata.cpp.o
ccache c++ -Itest/metadata-test.p -Itest -I../test -I/usr/local/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -Werror -std=c++11 -g -Werror -Wall '-DVERSION="3.1.3"' -DGTEST_HAS_PTHREAD=1 -pthread -MD -MQ test/metadata-test.p/.._src_metadata.cpp.o -MF test/metadata-test.p/.._src_metadata.cpp.o.d -o test/metadata-test.p/.._src_metadata.cpp.o -c ../src/metadata.cpp
../src/metadata.cpp: In static member function ‘static const zim::Metadata::ReservedMetadataRecord& zim::Metadata::getReservedMetadataRecord(const string&)’:
../src/metadata.cpp:100:14: error: ‘out_of_range’ is not a member of ‘std’
100 | throw std::out_of_range(name + " is not a reserved metadata name");
| ^~~~~~~~~~~~
../src/metadata.cpp:101:1: error: control reaches end of non-void function [-Werror=return-type]
101 | }
| ^
cc1plus: all warnings being treated as errors
[41/48] Compiling C++ object test/tools-test.p/.._src_zimwriterfs_zimcreatorfs.cpp.o
ninja: build stopped: subcommand failed.
I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none.
Couple of comments (already identified):
Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid?
I join @kelson42 in thinking we want basic checks for now that we could extend in the future.
And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs.
Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard.
Very thoughtful response from @rgaudin, and a big thank you to all 🏗️
I strongly support organic and free-form metadata standards (what's needed are strong norms and strong guidelines not bureaucratic rules) that allow grassroots initiatives to collaborate & innovate efficiently.
In fact even semi-structured data sometimes has an extremely valuable place along the way — thereby empowering regional and specialized communities to build their own ZIM files, with the metadata that their region/profession/culture truly needs.
For this reason I very strongly support allowing "free-form metadata fields" that not only permit but encourage grassroots (not centralized) community innovation to truly flourish.
Then later on, as strong community norms are independently nurtured + demonstrated + proven year-by-year-by-year, the world should honor those great grassroots practices — as they become more official metadata standards.
Central authorities (Kiwix) should provide basic guardrails & guidelines of course, but that's sufficient 👍
Thank you to everyone including @veloman-yunkan and @kelson42 and @mgautierfr working very hard on this critical question, helping it to evolve quickly in coming years, and every step of the way.
I have open a issue #340 to discuss what to test and how. Please continue the discussion there and keep this discussion for PR review.
I just want to emphasis again that we need this basic metadata feature pretty quickly. Thank you for having open side tickets to continue the interesting discussion and keep this PR "simple" without "insulting" the future.
@veloman-yunkan I hope you got all the feedbacks needed, let us know othwerwise.
With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?
With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?
A preliminary version of this PR (reflecting feedback received on the draft version) is ready. Note that the PR doesn't utilize the new Metadata
utility in any of the zim tools (zimcheck, zimwriterfs).
@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck.
@mgautierfr any feedback/review?
How do you plan to implement
checkTextLanguage
? Or it is just a sample and it have to be removed before merge ?
In this PR checkTextLanguage()
was introduced only to illustrate support for complex constraints. I will get rid of it before merging.
In this PR checkTextLanguage() was introduced only to illustrate support for complex constraints. I will get rid of it before merging.
Ok. You can remove it right now (or comment it to not loose the illustration purpose) while you fix the issue with last
pointer.
@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck.
I added usage of this new utility in zimwriterfs
. As a result a few issues (including two preexisting bugs related to the --longDescription
option of zimwriterfs
) were uncovered during testing. I will finalize this PR tomorrow.
The PR is almost ready. zim::Metadata
is now used both in zimwriterfs
and zimcheck
. However, the metadata constraints were relaxed in order to let the zimcheck
unit tests pass with existing test ZIM data. Once this version is "approved", I will fix/regenerate the test data and restore the metadata constraints to the state corresponding to the current spec.
Note that metadata constraints were relaxed in order to let the zimcheck unittests pass without updating/regenerating the unittest ZIM data.
How complex it is to regenerate the unittest data ?
The regeneration itself should be easy. The question is whether I should use the current release of zimwriterfs
or the one used to create the current version of test ZIM files.
The regeneration itself should be easy. The question is whether I should use the current release of zimwriterfs or the one used to create the current version of test ZIM files.
I think here this is not relevant (at least for the change we want to made). But the question is not easy to answer for generic case. The solution is probably in creating a full test suite with zim files created/tested with different version (https://github.com/openzim/zim-testing-suite)
The PR is final (though needs at least a rebase&fixup unless other issues are spotted).
Known issues:
Regexp constraints can duplicate length constraints. For example for Language metadata the following pair of errors is reported when the language is specified using a 2-char code:
Language must contain at least 3 characters
Language doesn't match regex: \w{3}(,\w{3})*
Regexp errors are not user-friendly. A solution to this issue is to provide a human comprehensible description of the constraint along the regexp. This approach smoothly extends to the proposed more general constraint framework with polymorphic validators (with regexps being only one of implementations).
@veloman-yunkan For now, I'm happy with this kind of regex error.
Please rebase-fixup and we can merge.
Please rebase-fixup and we can merge.
Done
Regexp constraints can duplicate length constraints. For example for Language metadata the following pair of errors is reported when the language is specified using a 2-char code:
- Language must contain at least 3 characters
- Language doesn't match regex: \w{3}(,\w{3})*
I think this is a problem of coherence between checks and it can be fixed later (and we probably have to update the checks). For now the structure is good and we can merge this long PR.
This PR addresses https://github.com/openzim/zim-tools/issues/334#issuecomment-1476307896
Fixes #334 Fixes #336