Mess with `zim::Archive::getEntryByPath(const std::string& path)`

veloman-yunkan commented 10 months ago

The implementation of zim::Archive::getEntryByPath(const std::string& path) is inconsistent with its documentation.

The declaration/documentation states https://github.com/openzim/libzim/blob/56b1d9e7480e4055eda8f28f0046c4bec9bdfbcb/include/zim/archive.h#L221-L230

Note the "The path must contains the namespace" requirement.

However, for ZIM files using the new namespace scheme, the implementation assumes that the path references an item from the C namespace. Moreover, it may rewrite the path while looking for the result in the said namespace. https://github.com/openzim/libzim/blob/56b1d9e7480e4055eda8f28f0046c4bec9bdfbcb/src/archive.cpp#L215-L230

It first tries to access the entry with the C namespace explicitly added to the path. If no such entry exists, it then replaces the namespace-like component in the path with C and tries to return the entry at that path instead. This can lead to false results. Suppose that the ZIM file (utilizing the new namespace scheme) doesn't contain entries under Q/ or C/Q/ path prefix. Yet the request for Q/something will be converted to C/something and that item will be returned if it exists.

Thus for ZIM files utilizing the old namespace scheme getEntryByPath() can be used to access any entry, whereas for ZIM files utilizing the new namespace scheme getEntryByPath() only returns entries from the C namespace and it may cheat while doing so.

mgautierfr commented 10 months ago

I think the documentation should be fixed. Something like:

The path must contains the namespace if applicable (old namespace scheme).

Thus for ZIM files utilizing the old namespace scheme getEntryByPath() can be used to access any entry, whereas for ZIM files utilizing the new namespace scheme getEntryByPath() only returns entries from the C namespace

This is known (at least by me 😄). This is why I propose in https://github.com/openzim/zim-tools/issues/315#issuecomment-1222365678 to totally drop the namespace from the api and provide other way (--metadata, other ?) to retrieve other content.

The compatibility layer should be consistent: Path returned by Entry::get_path() can be used by Archive::getEntryByPath without modification. (Whatever it contains a namespace or not). The purpose is to get ride of the notion of namespace (if there is one, it will be seen as a subdirectory, as it is/was on url)

The main problem is about path stored elsewhere than libzim (bookmarks) with may now become invalid with a update of a zim file. (If we think that this case is pretty rare or acceptable, we could remove the compatibility layer)

kelson42 commented 10 months ago

@mgautierfr We have to be able to list entries per namespace in the libzim. I strongly oppose the idea that only the namespace C entries are free to the publisher to customize. We should really fix this soon, zimdump been broken, this is a regression and we have real lack of introspection tool.

mgautierfr commented 10 months ago

I strongly oppose the idea that only the namespace C entries are free to the publisher to customize

Well, that was the whole purpose of https://github.com/openzim/libzim/issues/345 and https://github.com/openzim/libzim/pull/406, which is 3 years old.

It was explained in https://github.com/openzim/libzim/issues/325#issue-609126653 (explicitly in Removing namespace section)

In https://github.com/openzim/zim-tools/issues/315#issuecomment-1235570245 you explicitly say that "The right approach is probably to remove fully the namespace concept"

If, as you said in https://github.com/openzim/zim-tools/pull/316#pullrequestreview-1749210968 you have fail to know if a zim has a full text index or not, we should probably fix zimdump to give this information with zimdump info. (libzim itself already have the hasFulltextIndex method)

veloman-yunkan commented 10 months ago

As long as docstrings are concerned, it looks like all API functions mentioning the namespace concept have to be reviewed and their documentation corrected.

kelson42 commented 10 months ago

@mgautierfr Thank you for your documented feedback. This complete the live discussion we had two weeks ago.

Let me try to better phrase my opinion (and it might have changed since last years): 1 - There is a clear functionality problem at zimdump level 2 - There is at least a problem of feature a the libzim problem where we have implicit behaviour, misleading function/method names 3 - I believe pushing the idea that zimdump can not list any namespace except C (or maybe H at some point) to be an error caused by pushing the concept of namespace abstraction too far (ie. totally unavailable for anything outside the libzim). I might have changed of opinion on this and you have - with talent - argumented against that idea... this stays my feeling.

I might have misunderstanding something around the lizbim7 namespace concept. To me, more probably, this was something always a bit unclear ("fully hide namespaces to everything" vs "hide namespaces to the scraper and readers by providing user friendly primitives") as the main goal was to get things working. The fact that the libzim is not fully "clear" around the namespace depending functions and that zimdump (basically the only tool "presenting" the namespaces) tend the validate this theory.

Today, why we are here does not matter very much. I would be happy to be wrong on the whole line if things are clean and working fine. Getting to that level is the priority. @veloman-yunkan @mgautierfr Could you agree on a plan for both libzim/zimdump so we can move forward?

mgautierfr commented 10 months ago

I see two solutions here (which are not exclusives)

List what is expected for things which "are clean and working fine". (Ie, what features we want in zimdump). Issue openzim/zim-tools#316 is somehow already expecting a way to implement things, it is not a feature request (even if there is nothing wrong with the issue). My point is that we should remove namespace option from zimdump, but we have to list what we want to do with it to check it is ok and see what API we need. THEN, extend the libzim API to provide access to specific data. (We already have API for access user content (C), metadata (M), xapian databases (through search functions or with hasFulltextIndex))
Add a low level api in libzim to access raw/low structures in zim file. (FileImpl may be a good candidate)

Jaifroid commented 9 months ago

I see three arguments for keeping the ability of libzim to understand (at least) namespaces:

Backward compatibility with Type 0 ZIMs (lots of people archive them), since these use the full range of namespaces;
The fact that under the hood, there actually are namespaces C M W X in the ZIM, they are just generally abstracted by libzim (but not by the Kiwix JS implementation);
Future needs we can probably anticipate, e.g., storing Request or Response Headers in an H namespace, and others we can't anticipate right now.

Regarding this issue, all I can add is that from the Kiwix JS perspective, it would be useful to have consistency amongst the libzim APIs (we are in the process of integrating the javascript-libzim port into Kiwix JS more fully).

kelson42 commented 9 months ago

List what is expected for things which "are clean and working fine". (Ie, what features we want in zimdump).

It has been essential to provide namespace independant, high level, functions and helper at both writer and reader level. This has been necessary to make everything robust, simpler and offering a homogenous user experience throug all readers.

But zimdump should continue to work like it was, current and past usage() are OK. Like said, zimdump is not a normal reader, it's an inspection tool. I don't want and I don't know how I want it differently to what the zimdump usage() currently describe. I don't want a special behaviour for this or that namespace in zimdump or if that should exist, this would come on the top of basic namespace related features.

Now, we should:

Ensure high level functions to be used on a daily base for both readers and writers are really namespace indenpendant and that there is no ambiguity
Keep low-level functions available for zimdump and actually for anybody wanting to access to libzim internals. We can call them how we want be they need to be clearly available and identifiable.
Secure the documentation is clear and coherent with the reality.

mgautierfr commented 9 months ago

After discussion (and few time to sleep on it), I propose this:

Fix the documentation for zim::Archive::getEntryByPath. The purpose of this method is to access "content" entries. In new namespace scheme, it ALWAYS entries in C namespace. On older zims, it may be in A, J, I namespaces. All the code messing with namespace in getEntryByPath is to provide a compatibility layer for potential bookmarks. Not to access content in another namespace.
Implement missing high level API (if any).
Implement a private API (to be put under the existing ZIM_PRIVATE flag) getEntryByPathWithNs(char ns, const std::string& path) to access "all" entries.
On zimdump side, use the higher api when possible to give information about zim file. (Issues on zim-tools to create)
On zimdump side, use getEntryByPathWithNs to provide access to all entries (assuming that zimdump is somehow also a debug tools for zim file)

kelson42 commented 9 months ago

@veloman-yunkan Are you OK with proposal? @mgautierfr would do the libzim part and you would fix zimdump.

kelson42 commented 8 months ago

@mgautierfr No feedback from @veloman-yunkan, please carry one with implementation.

veloman-yunkan commented 8 months ago

@veloman-yunkan Are you OK with proposal?

At least I don't have any objections.

openzim / libzim

Mess with `zim::Archive::getEntryByPath(const std::string& path)` #843