openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 22 forks source link

Hints are not used #125

Closed rgaudin closed 2 years ago

rgaudin commented 2 years ago

It seems hat while get_hints() is properly called by libzim, its result is either malformed or not used and we always fallback to the mimetype-based default.

import pathlib

import libzim.writer
from libzim.reader import Archive
from libzim.suggestion import SuggestionSearcher
from libzim.writer import (
    Creator,
    FileProvider,
    Hint,
    StringProvider,
)

class StaticItem(libzim.writer.Item):
    def __init__(self, **kwargs):
        super().__init__()
        for k, v in kwargs.items():
            setattr(self, k, v)

    def get_path(self) -> str:
        return getattr(self, "path", "")

    def get_title(self) -> str:
        return getattr(self, "title", "")

    def get_mimetype(self) -> str:
        return getattr(self, "mimetype", "")

    def get_contentprovider(self) -> libzim.writer.ContentProvider:
        if getattr(self, "filepath", None):
            return FileProvider(filepath=self.filepath)
        return StringProvider(content=getattr(self, "content", ""))

    def get_hints(self):
        # print(f"...get_hints() for {self.get_path()} -- {self.hints=}")
        return getattr(self, "hints", dict())

def main():
    fpath = pathlib.Path("test-hints.zim")
    yes_front = {Hint.FRONT_ARTICLE: True}
    no_front = {Hint.FRONT_ARTICLE: False}
    auto_front = {}
    with Creator(fpath) as c:
        c.add_item(
            StaticItem(
                path="f-art1",
                title="Article 1",
                mimetype="text/html",
                content=b"test",
                hints=yes_front,
            )
        )
        c.add_item(
            StaticItem(
                path="f-art2",
                title="Article 2",
                mimetype="text/html",
                content=b"test",
                hints=auto_front,
            )
        )
        c.add_item(
            StaticItem(
                path="n-art3",
                title="Article 3",
                mimetype="text/html",
                content=b"test",
                hints=no_front,
            )
        )
        c.add_item(
            StaticItem(
                path="f-art4",
                title="Article 4",
                mimetype="application/pdf",
                content=b"test",
                hints=yes_front,
            )
        )

    zim = Archive(fpath)
    print(f"{zim.has_title_index=}")
    suggestion_searcher = SuggestionSearcher(zim)
    suggestion = suggestion_searcher.suggest("Article")
    nb_res = suggestion.getEstimatedMatches()
    print(f"{nb_res=}")
    results = list(suggestion.getResults(0, nb_res))
    print(f"{results=}")

if __name__ == "__main__":
    main()

Output:

❯ python test_hints.py
Resolve redirect
set index
zim.has_title_index=True
nb_res=3
results=['f-art1', 'f-art2', 'n-art3']

We should have three results but the expected ones should be:

mgautierfr commented 2 years ago

Here what is done :

So the behavior of you test is correct with what is done. FRONT_ARTICLE is correctly use and we are indexing (and retrieving) all html content.

However, we may want to change the behavior :

rgaudin commented 2 years ago

OK, glad there is no bug but the doc is either incorrect or misleading.

FRONT_ARTICLE mark entry (item or redirection) as main article for the reader (typically a html page in opposition to a resource file as css, js, …). Random and suggestion feature will search only in entries marked as FRONT_ARTICLE. If no entry are marked as FRONT_ARTICLE, all entries will be used.

I am worried about two things:

The problem is, FRONT_ARTICLE is pretty expressive, and while not setting it defaulting to whatever implementation we want is OK, setting it should settle the behavior: True, it's a front-article, I want it to show up in suggestions. False, it's shouldn't.

Now I understand the problem with search. We are using the same FRONT_ARTICLE concept to filter entries between those exposed and those that are not. So, following this assertion, suggestion and search should follow the same criteria (assuming data is indexed).

Sketch of expected behavior:

# a creation time
if FRONT_ARTICLE is None:
  FRONT_ARTICLE = guess_from_mimetype()
if FRONT_ARTICLE:
    add_to_titleListing()
    if with_xapian:
        add_to_xapian_title_index()
        if item.has_index_data:
            add_to_xapian_content_index()

# suggest()
if has_xapian_title_index():
   find_entries_in_index()
else:
    find_entries_in_listing()

# search()
if has_xapian_index():
   find_entries_in_index_matching()

what do you think?

kelson42 commented 2 years ago

Seems indeed a bit hard to understand and hard to know how to simplify :) Should we keep this for the hackathon?

rgaudin commented 2 years ago

Yes, that's a good idea. Adding it to the Wiki and adapting scraperlib tests to current behavior.

kelson42 commented 2 years ago

We need first to implement openzim/libzim#642 to then be able to check if we could close this ticket.

rgaudin commented 2 years ago

And #92 as well

kelson42 commented 2 years ago

https://github.com/openzim/libzim/issues/642 has been implemented. What should we do with this ticket?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 2 years ago

@mgautierfr @rgaudin I think (have unterstood) that this ticket is super straight forward. Can we go forward. Who? Should reassign it to @mgautierfr

rgaudin commented 2 years ago

Tested above code with current codebase and got the result that past-me said was expected.