Closed veloman-yunkan closed 7 months ago
I think the documentation should be fixed. Something like:
The path must contains the namespace if applicable (old namespace scheme).
Thus for ZIM files utilizing the old namespace scheme getEntryByPath() can be used to access any entry, whereas for ZIM files utilizing the new namespace scheme getEntryByPath() only returns entries from the C namespace
This is known (at least by me 😄). This is why I propose in https://github.com/openzim/zim-tools/issues/315#issuecomment-1222365678 to totally drop the namespace from the api and provide other way (--metadata
, other ?) to retrieve other content.
The compatibility layer should be consistent: Path returned by Entry::get_path()
can be used by Archive::getEntryByPath
without modification. (Whatever it contains a namespace or not). The purpose is to get ride of the notion of namespace (if there is one, it will be seen as a subdirectory, as it is/was on url)
The main problem is about path stored elsewhere than libzim (bookmarks) with may now become invalid with a update of a zim file. (If we think that this case is pretty rare or acceptable, we could remove the compatibility layer)
@mgautierfr We have to be able to list entries per namespace in the libzim. I strongly oppose the idea that only the namespace C entries are free to the publisher to customize. We should really fix this soon, zimdump been broken, this is a regression and we have real lack of introspection tool.
I strongly oppose the idea that only the namespace C entries are free to the publisher to customize
Well, that was the whole purpose of https://github.com/openzim/libzim/issues/345 and https://github.com/openzim/libzim/pull/406, which is 3 years old.
It was explained in https://github.com/openzim/libzim/issues/325#issue-609126653 (explicitly in Removing namespace
section)
In https://github.com/openzim/zim-tools/issues/315#issuecomment-1235570245 you explicitly say that "The right approach is probably to remove fully the namespace concept"
If, as you said in https://github.com/openzim/zim-tools/pull/316#pullrequestreview-1749210968 you have fail to know if a zim has a full text index or not, we should probably fix zimdump to give this information with zimdump info
. (libzim itself already have the hasFulltextIndex method)
As long as docstrings are concerned, it looks like all API functions mentioning the namespace concept have to be reviewed and their documentation corrected.
@mgautierfr Thank you for your documented feedback. This complete the live discussion we had two weeks ago.
Let me try to better phrase my opinion (and it might have changed since last years):
1 - There is a clear functionality problem at zimdump
level
2 - There is at least a problem of feature a the libzim problem where we have implicit behaviour, misleading function/method names
3 - I believe pushing the idea that zimdump can not list any namespace except C
(or maybe H
at some point) to be an error caused by pushing the concept of namespace abstraction too far (ie. totally unavailable for anything outside the libzim). I might have changed of opinion on this and you have - with talent - argumented against that idea... this stays my feeling.
I might have misunderstanding something around the lizbim7 namespace concept. To me, more probably, this was something always a bit unclear ("fully hide namespaces to everything" vs "hide namespaces to the scraper and readers by providing user friendly primitives") as the main goal was to get things working. The fact that the libzim is not fully "clear" around the namespace depending functions and that zimdump (basically the only tool "presenting" the namespaces) tend the validate this theory.
Today, why we are here does not matter very much. I would be happy to be wrong on the whole line if things are clean and working fine. Getting to that level is the priority. @veloman-yunkan @mgautierfr Could you agree on a plan for both libzim/zimdump so we can move forward?
I see two solutions here (which are not exclusives)
C
), metadata (M
), xapian databases (through search functions or with hasFulltextIndex
))FileImpl
may be a good candidate)I see three arguments for keeping the ability of libzim to understand (at least) namespaces:
Regarding this issue, all I can add is that from the Kiwix JS perspective, it would be useful to have consistency amongst the libzim APIs (we are in the process of integrating the javascript-libzim port into Kiwix JS more fully).
List what is expected for things which "are clean and working fine". (Ie, what features we want in zimdump).
It has been essential to provide namespace independant, high level, functions and helper at both writer and reader level. This has been necessary to make everything robust, simpler and offering a homogenous user experience throug all readers.
But zimdump
should continue to work like it was, current and past usage()
are OK. Like said, zimdump
is not a normal reader, it's an inspection tool. I don't want and I don't know how I want it differently to what the zimdump usage() currently describe. I don't want a special behaviour for this or that namespace in zimdump or if that should exist, this would come on the top of basic namespace related features.
Now, we should:
After discussion (and few time to sleep on it), I propose this:
zim::Archive::getEntryByPath
. The purpose of this method is to access "content" entries. In new namespace scheme, it ALWAYS entries in C
namespace. On older zims, it may be in A
, J
, I
namespaces. All the code messing with namespace in getEntryByPath
is to provide a compatibility layer for potential bookmarks. Not to access content in another namespace.ZIM_PRIVATE
flag) getEntryByPathWithNs(char ns, const std::string& path)
to access "all" entries.getEntryByPathWithNs
to provide access to all entries (assuming that zimdump is somehow also a debug tools for zim file)@veloman-yunkan Are you OK with proposal? @mgautierfr would do the libzim part and you would fix zimdump.
@mgautierfr No feedback from @veloman-yunkan, please carry one with implementation.
@veloman-yunkan Are you OK with proposal?
At least I don't have any objections.
The implementation of
zim::Archive::getEntryByPath(const std::string& path)
is inconsistent with its documentation.The declaration/documentation states https://github.com/openzim/libzim/blob/56b1d9e7480e4055eda8f28f0046c4bec9bdfbcb/include/zim/archive.h#L221-L230
Note the "The path must contains the namespace" requirement.
However, for ZIM files using the new namespace scheme, the implementation assumes that the path references an item from the
C
namespace. Moreover, it may rewrite the path while looking for the result in the said namespace. https://github.com/openzim/libzim/blob/56b1d9e7480e4055eda8f28f0046c4bec9bdfbcb/src/archive.cpp#L215-L230It first tries to access the entry with the
C
namespace explicitly added to the path. If no such entry exists, it then replaces the namespace-like component in the path withC
and tries to return the entry at that path instead. This can lead to false results. Suppose that the ZIM file (utilizing the new namespace scheme) doesn't contain entries underQ/
orC/Q/
path prefix. Yet the request forQ/something
will be converted toC/something
and that item will be returned if it exists.Thus for ZIM files utilizing the old namespace scheme
getEntryByPath()
can be used to access any entry, whereas for ZIM files utilizing the new namespace schemegetEntryByPath()
only returns entries from theC
namespace and it may cheat while doing so.