openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
164 stars 50 forks source link

Introduce addAlias method. #833

Closed mgautierfr closed 9 months ago

mgautierfr commented 10 months ago

This is the first PR for https://github.com/kiwix/overview/issues/95

Fix #824

codecov[bot] commented 10 months ago

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (ca1bcee) 57.57% compared to head (fcc3eec) 57.60%.

:exclamation: Current head fcc3eec differs from pull request most recent head 9a59af5. Consider uploading reports for the commit 9a59af5 to get more accurate results

Files Patch % Lines
src/writer/creator.cpp 37.50% 4 Missing and 6 partials :warning:
src/writer/_dirent.h 64.28% 3 Missing and 2 partials :warning:
src/item.cpp 0.00% 2 Missing :warning:
src/writer/dirent.cpp 83.33% 0 Missing and 1 partial :warning:
src/writer/direntPool.h 90.90% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #833 +/- ## ========================================== + Coverage 57.57% 57.60% +0.02% ========================================== Files 99 99 Lines 4542 4590 +48 Branches 1909 1922 +13 ========================================== + Hits 2615 2644 +29 - Misses 668 677 +9 - Partials 1259 1269 +10 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mgautierfr commented 10 months ago

In the current implementation information about presence of clones in nowhere recorded.

It is somehow intended. In fact nothing in the specification (from the beginning) prevent two dirent to point to the same data. It is just that no implementation allow that. I don't see why we should inform about their presence. Especially, reader should always be prepared to that.

Have conscious decisions been made (and properly implemented) regarding the following questions:

  • How should clones be accounted for in media type counts?
  • How should full text indexing on cloned entries work? If FT indexing associates a document only with the entry added via the zim::Creator::addItem() method, then the cloned dirents better be named aliases because of the resulting asymmetry.

Indeed. For now, clones are handled "as redirect":

This is mainly a technical limitation as we don't have access to the content (only the cluster/blob index) when we add the clone.

Shouldn't it be possible to iterate over items while skipping cloned entries (i.e. having each data item returned only once)? Shouldn't it be possible to find out all the clone-siblings of a given dirent?

I'm not sure we need that. The only need I see for now is in zimcheck but I handle that there.


This new feature may evolve and be used for something not foreseen. But for now, the use case is to "reference" binary content (as youtube video) under different names for zimit files.

[...] then the cloned dirents better be named aliases because of the resulting asymmetry.

This a good point. I agree with you. Maybe alias is a better name.

I have named it with a generic name as it is technically a generic feature but the intended use case is pretty limited:

So I don't think we need to do more, at least for now. If other use cases appear, we may extend our API accordingly.

veloman-yunkan commented 10 months ago

Shouldn't it be possible to iterate over items while skipping cloned entries (i.e. having each data item returned only once)? Shouldn't it be possible to find out all the clone-siblings of a given dirent?

I'm not sure we need that. The only need I see for now is in zimcheck but I handle that there.

Recovering information about a dirent being an alias requires extra effort. Can't we preserve it easily? For example is Dirent::parameter used for anything? If not, why don't we record the dirent's purpose there?

mgautierfr commented 10 months ago

Dirent::parameter used for anything? If not, why don't we record the dirent's purpose there?

Dirent::parameter is indeed not used for now. But is our only way to extend a dirent without changing the zim format so if we start to use it, we have to define a way to use it for several things. If we use it "blindly" to store "isClone" we lost a way to store other values.

The idea of this change was it was "creator" only and do not change anything on reading part. Let's me think on this 🙂

kelson42 commented 9 months ago

@mgautierfr Status not clear to me, seems you are over your side but new review has not been requested.

mgautierfr commented 9 months ago

Status not clear to me, seems you are over your side but new review has not been requested.

PR updated and ready for another review.

What is left open is the question about extending the format to use Dirent::parameter. I still not convince about that. But neither have a strong opinion on the opposite side.

I would suggest that we go with this PR and discuss the extension of the format in a issue. But the issue would have to be fixed/resolved BEFORE we do another release and start creating zim file with alias.

As long as I have not finished https://github.com/kiwix/overview/issues/95, there is no need to merge it. So maybe we can simply continue the review process to agree with @veloman-yunkan.

kelson42 commented 9 months ago

@veloman-yunkan

What is left open is the question about extending the format to use Dirent::parameter. I still not convince about that. But neither have a strong opinion on the opposite side.

I'm not convince about the necessity to do that in general, in right now in particular, I would recommend to open a dedicated ticket to keep track of this idea.

But it is pretty urgent to move forward with the code review process!

That said, we should be able to identify clearly a ZIM file which clearly potentialy have such clones (even if this was allowed implicitly in the past, it will be from now explicitly) I would recommend to: