openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 20 forks source link

Libzim next #82

Closed mgautierfr closed 3 years ago

mgautierfr commented 3 years ago

This is an adaptation of python-libzim to the new libzim api.

The new api concentrate on reading and writing part (not search).

C++ api provides some small default implementation for item and contentprovider. It is not wrapped in python api. This is open to discussion but I think we should not wrap them but reimplement them in python (still being part of the api). Wrapping the default implementation may be difficult (especially the fact that a Item may return a python or cpp implemented ContentProvider). The implementation is pretty simple, we can directly implement them in python without having to take care of that.

rgaudin commented 3 years ago

ISSUE: unexpected suggest behavior

Using libzim_next (nightly version from 2021-01-18)

Using isago ZIM, which is composed of a few articles with all the same title (I SAGO – tarifs de l'éléctricité au Mali), I get a correct behavior when searching:

All of this is OK.

When getting suggestion, given I have 4 articles with same title I should either get no result or 4 results.

Might be a libzim error or just a misinterpretation from my side.

rgaudin commented 3 years ago

ISSUE: Segfaults when setting config

import pathlib
from libzim.writer import Creator

with Creator(pathlib.Path("test.zim")) as c:
   c.configIndexing(True, "eng")

This segfaults while the same code without configIndexing created a proper (yet empty) ZIM.

rgaudin commented 3 years ago

QUESTION: Should we provide access to entry ID?

Entry's ID is not exposed in the python API and I don't think we need it but the fact that entry.get_redirect_entry() would raise RuntimeError: Entry 2 is not a redirect entry. makes me think this is not coherent. Either we should provide this ID or be path-first and print that libzim error message with the path instead of the ID.

rgaudin commented 3 years ago

QUESTION: can you clarify the difference between the Entry's path and the Item's path?

Are those those separate values in the ZIM or is the Entry's path just a shortcut to the item's one? It would seem so from a writer's perspective as we only create Item and thus provide this data only once but we can have something like

zim.main_entry.path
> 'mainPage'

zim.main_entry.get_item().path
> 'home'

Is this just because get_item() resolves the redirect?

rgaudin commented 3 years ago

QUESTION: Main Entry's behavior between new and old ZIMS

With a ZIM created using old libzim, it gets the actual home article on zim.get_main_entry. Using this new creator, I can only get a stub redirect

# using older zim
zim.main_entry
> Entry(url=A/index.html, title=index.html)

# using new creator
zim.main_entry
> Entry(url=mainPage, title=mainPage)
zim.get_entry_by_path("mainPage")
> RuntimeError: Cannot find entry

Is this the wanted behavior? I think we'll have to adapt scraper/tools' tests to this.

rgaudin commented 3 years ago

QUESTION: Why are we returning self on configXXX() calls?

rgaudin commented 3 years ago

ISSUE: Interpreter crash on missing gen_blob()

If the subclassed ContentProvider doesn't implement gen_blob, the interpreter crashes with Abort trap 6. See test_notimplementing_contentprovider_gen_blob in test_libzim.py

rgaudin commented 3 years ago

QUESTION: Should we provide start/end methods?

We've discussed that in the previous impl. I believe. I don't have a clear use case for this but It would be close to nothing to maintain and would add some flexibility to users.

codecov[bot] commented 3 years ago

Codecov Report

Merging #82 (a29dff3) into master (051a07e) will decrease coverage by 3.70%. The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##            master      #82      +/-   ##
===========================================
- Coverage   100.00%   96.29%   -3.71%     
===========================================
  Files            2        3       +1     
  Lines          103      297     +194     
===========================================
+ Hits           103      286     +183     
- Misses           0       11      +11     
Impacted Files Coverage Δ
libzim/reader.py 100.00% <100.00%> (ø)
libzim/wrapper.pyx 95.06% <100.00%> (ø)
libzim/writer.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 051a07e...a29dff3. Read the comment docs.

rgaudin commented 3 years ago

QUESTION: Am I right to assume that kiwix-serve will not be serving metadata on M/ prefix with that new libzim (unless it implements a new, special behavior for it) ?

Related: with this new impl, we don't have access to other namespaces content (except metadata through those 2 methods), right ?

rgaudin commented 3 years ago

Related: with this new impl, we don't have access to other namespaces content (except metadata through those 2 methods), right ?

Please disregard this comment, I understand we can access some namespaces by prefixing the path.

There is no notion of namespace in the API. If you want a metadata, use Archive::getMetadata(). If you want an entry, use Archive::getEntryByPath(). Either by using a "full" path (with the namespace) or a "simple" path (for compatibility libzim search in namespace 'A', 'I', 'J' and '-').

mgautierfr commented 3 years ago

Here few answers. Please open issues for the bugs.


ISSUE: unexpected suggest behavior

This is a real issue with the search but probably not related to libzim next. The suggestion is base on a xapian database feeded only with title. The results come from xapian. We probably miss creating the database or something else.

ISSUE: Segfaults when setting config

This is a issue. Does it crash directly when you call the configIndex method or later ? (With the code you've provided, "later" is just after as we are closing the contextManager)

QUESTION: Should we provide access to entry ID?

You are right. We should only use path.

QUESTION: can you clarify the difference between the Entry's path and the Item's path?

On a zim structure point-of-vue, there is no Entry or Item. There are dirents, and dirents may be a redirect one or a "classic" one (with content associated).

The new API hides this and create two new entities that are purely conceptual. The entry is the entry point to something. It could be a redirect or a classical dirent. The item is the "content". It units the classical dirent and the associated content.

The item is not a different thing than a entry. It is a "upgraded" entry. The path/title of a entry is the same than the associated item.

get_item() resolve the redirect for you, so potentially the given item is not the "same" than the entry. Cpp getItem() has a boolean parameter telling if we should resolve the redirect and give the resolved item or not (and raise a exception if the entry is a redirect) I don't remember why I've not exposed this parameter in the python version.

On the writer side, you can indeed only add a direction (using addRedirection) or add an item (addItem).

QUESTION: Main Entry's behavior between new and old ZIMS

The "new" mainPage is a redirection to the entry configured with setMainPath (before it was directly the configured entry). This is needed as on kiwix-serve we need to have a redirection to ensure internal relative path are correct. We were handle this in kiwix-lib but it is better to have it directly in libzim.

zim.get_entry_by_path(zim.main_entry.get_item().path) should always work.

QUESTION: Why are we returning self on configXXX() calls?

This is useful to chain calls :

zim_creator = Creator(tmpdir / "test.zim")
zim_creator.configIndexing(True, "eng").configFoo().configBar()

One important point slightly explained in cpp (https://github.com/openzim/libzim/blob/master/include/zim/writer/creator.h#L35-L46) and not explain in python is that the configXXX must be called before the start of the creation process (where setXXX can be call before or after the start) It may explain the segfaults with configIndex. (We should document that, and properly handle the error if user call configXXX after the creation)

ISSUE: Interpreter crash on missing gen_blob()

I will investigate.

QUESTION: Should we provide start/end methods?

Not sure of that. It is easier to add them if needed than remove them. Let's move without them for now.

QUESTION: Am I right to assume that kiwix-serve will not be serving metadata on M/ prefix with that new libzim (unless it implements a new, special behavior for it) ?

Yes, get_entry_by_path only allow to get content add in the zim using addRedirection or addItem other kind of data must use another (specific) methods. kiwix-serve already has a way to get metadata for a zim file : /meta?content=foo_zim_name&name=author

I understand we can access some namespaces by prefixing the path.

This works only for old zim file for compatibility. On new zim file, the content is always search in the "main" namespace. You cannot access M/W/X namespace with get_entry_by_path

rgaudin commented 3 years ago

Thanks for the detailed response, here are the remaining actions to take:

rgaudin commented 3 years ago

@mgautierfr, I've added support for a few methods on the reader that I believe are useful. I've added those on separate commits so we can exclude them if you think we don't want pylibzim users to access them. I think all of those make sense except maybe for Item.size which is a libzim-way around Item.content.nbytes I've voluntarily skipped getEntryByTitle and similar though.

And please take a look at 3658ab0, there's probably a better way than using a temp string.

mgautierfr commented 3 years ago

ISSUE: Interpreter crash on missing gen_blob() I will investigate.

Can you test with https://github.com/openzim/libzim/tree/thread_exception branch ?

rgaudin commented 3 years ago

Coverage is not at 100% but close:

I am ready to merge this as soon as @mgautierfr agrees

mgautierfr commented 3 years ago

I agree to merge. There is probably more to do but let's do it in other PRs.