openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
169 stars 50 forks source link

Consider moving zim::Metadata into libzim #785

Open veloman-yunkan opened 1 year ago

veloman-yunkan commented 1 year ago

openzim/zim-tools#339 introduced zim::Metadata in zim-tools source tree so that it could be immediately used as a shared utility between zimwriterfs and zimcheck. However libzim is a more logical home for zim::Metadata. Then (after an easy enhancement of zim::Metadata) simple constraints (involving a single metadata item) can be checked in zim::writer::Creator::addMetadata() and full checks can be run in zim::writer::Creator::finishZimCreation().

kelson42 commented 1 year ago

@mgautierfr Any opinion? I have nothing against.

mgautierfr commented 1 year ago

I'm against that. libzim doesn't know about metadata. All the api is agnostic to the metadata (you have a getMetadata(std::string name) but no getTitle()).

This is something already discussed, but libzim is content agnostic. I consider the metadata described in libzim format more as kiwix specification than zim ones. Simply because zim files without these metadata can correctly being read by libzim (but maybe not by kiwix readers). Checking the metadata is at same level than checking the url inside the articles. It may break how we use zim files but not how we parse them.

kelson42 commented 1 year ago

I consider the metadata described in libzim format more as kiwix specification than zim ones.

@mgautierfr I understand the way how you make the difference, but this is pretty much a personal POV. Metadata specs are part of the ZIM specification like the rest.

Beside the question of the principle, do you see any bad effect that such a move would generate?

mgautierfr commented 1 year ago

We already had this discussion in https://github.com/openzim/zim-tools/issues/336 which is exactly the root issue leading to zim::Metadata. I let you read again my comment (https://github.com/openzim/zim-tools/issues/336#issuecomment-1474875186) you were agree with (to the point you moved the issue from libzim to zim-tools)

kelson42 commented 1 year ago

@mgautierfr I understand your opinion (about the separation of duties between the libzim and the zim-tools) which is the one we have pursued since a long time. At this stage, I either fully disagree or agree with it. Actually, so far I have been supportive of your POV like you have mentioned (but I disagree - and always have AFAIK - with the assertion that metadata size limits are not part of the ZIM specification).

But I see that this metadata checking is currently under implementation at multiple levels in scrapers and binding and zim-tools. I'm concerned about this redundancy. This new reality is why I have asked "Beside the question of the principle, do you see any bad effect that such a move would generate?" (which is not a question you have already answered in the past AFAIK).

rgaudin commented 1 year ago

Just to be clear, redundancy-wise (already wrote that somewhere else), the scrapers needs to validate Metadata early to provide direct feedback as most of those important metadata are user-provided.

This means having direct access to checks (could be possible via libzim) but also complete checks.

We are quite happy with scraperlib and I think that in this case it's more practical to have redundancy in scraperlib and JS-binding rather than bringing ISO-639-3 and PNG library to libzim.

The point is sort of the same: if the spec allows setting invalid metadata, it's hard to justify such requirements just to check them. Similar to the separate writer/reader libraries topic.

kelson42 commented 1 year ago

Just to be clear, redundancy-wise (already wrote that somewhere else), the scrapers needs to validate Metadata early to provide direct feedback as most of those important metadata are user-provided.

This is a good point I have forgotten to say. We might be more interested in the definition of the Metadata constraints in the libzim than in the check themselves. This is at least one of the altnernative I consider.