Hints are not used - Githubissues

rgaudin commented 2 years ago

It seems hat while get_hints() is properly called by libzim, its result is either malformed or not used and we always fallback to the mimetype-based default.

import pathlib

import libzim.writer
from libzim.reader import Archive
from libzim.suggestion import SuggestionSearcher
from libzim.writer import (
    Creator,
    FileProvider,
    Hint,
    StringProvider,
)

class StaticItem(libzim.writer.Item):
    def __init__(self, **kwargs):
        super().__init__()
        for k, v in kwargs.items():
            setattr(self, k, v)

    def get_path(self) -> str:
        return getattr(self, "path", "")

    def get_title(self) -> str:
        return getattr(self, "title", "")

    def get_mimetype(self) -> str:
        return getattr(self, "mimetype", "")

    def get_contentprovider(self) -> libzim.writer.ContentProvider:
        if getattr(self, "filepath", None):
            return FileProvider(filepath=self.filepath)
        return StringProvider(content=getattr(self, "content", ""))

    def get_hints(self):
        # print(f"...get_hints() for {self.get_path()} -- {self.hints=}")
        return getattr(self, "hints", dict())

def main():
    fpath = pathlib.Path("test-hints.zim")
    yes_front = {Hint.FRONT_ARTICLE: True}
    no_front = {Hint.FRONT_ARTICLE: False}
    auto_front = {}
    with Creator(fpath) as c:
        c.add_item(
            StaticItem(
                path="f-art1",
                title="Article 1",
                mimetype="text/html",
                content=b"test",
                hints=yes_front,
            )
        )
        c.add_item(
            StaticItem(
                path="f-art2",
                title="Article 2",
                mimetype="text/html",
                content=b"test",
                hints=auto_front,
            )
        )
        c.add_item(
            StaticItem(
                path="n-art3",
                title="Article 3",
                mimetype="text/html",
                content=b"test",
                hints=no_front,
            )
        )
        c.add_item(
            StaticItem(
                path="f-art4",
                title="Article 4",
                mimetype="application/pdf",
                content=b"test",
                hints=yes_front,
            )
        )

    zim = Archive(fpath)
    print(f"{zim.has_title_index=}")
    suggestion_searcher = SuggestionSearcher(zim)
    suggestion = suggestion_searcher.suggest("Article")
    nb_res = suggestion.getEstimatedMatches()
    print(f"{nb_res=}")
    results = list(suggestion.getResults(0, nb_res))
    print(f"{results=}")

if __name__ == "__main__":
    main()

Output:

❯ python test_hints.py
Resolve redirect
set index
zim.has_title_index=True
nb_res=3
results=['f-art1', 'f-art2', 'n-art3']

We should have three results but the expected ones should be:

f-art1 because it specifically sets FRONT_ARTICLE=True
f-art2 because it uses the default and has a text/html mimetype
f-art4 because it specifically sets FRONT_ARTICLE=True

mgautierfr commented 2 years ago

Here what is done :

FRONT ARTICLE and titleListingV1. Article are put in the title listing V1 only if the hint front article is set to 1 (true). Mimetype is not used at all.
Xapian indexing Article are "title indexed" if getIndexData return something, it has actually indexData (indexData->hasIndexData() returns true) and the title (indexData->getTitle()) is not empty. As we use the default index data, we index the article if it is html and have a title. We don't care about FRONT_ARTICLE here.
Suggestion If there is a title xapian indexing (always true for new zim create with libzim compiled with xapian), suggestion use the xapian indexing and not the titleListingV1. (Other implementation may do differently (kiwix-js never use xapian))

So the behavior of you test is correct with what is done. FRONT_ARTICLE is correctly use and we are indexing (and retrieving) all html content.

However, we may want to change the behavior :

Use the mimetype if the FRONT_ARTICLE is not passed (in this case, html is considered as a front article)
Use the FRONT_ARTICLE to xapian index or not the title. (But we have to be careful about coherence. What happens if FRONT_ARTICLE is false and we have a indexData with title ? Or the opposite ?)

rgaudin commented 2 years ago

OK, glad there is no bug but the doc is either incorrect or misleading.

FRONT_ARTICLE mark entry (item or redirection) as main article for the reader (typically a html page in opposition to a resource file as css, js, …). Random and suggestion feature will search only in entries marked as FRONT_ARTICLE. If no entry are marked as FRONT_ARTICLE, all entries will be used.

I am worried about two things:

it's a little too hard to predict
it's not coherent.

The problem is, FRONT_ARTICLE is pretty expressive, and while not setting it defaulting to whatever implementation we want is OK, setting it should settle the behavior: True, it's a front-article, I want it to show up in suggestions. False, it's shouldn't.

Now I understand the problem with search. We are using the same FRONT_ARTICLE concept to filter entries between those exposed and those that are not. So, following this assertion, suggestion and search should follow the same criteria (assuming data is indexed).

Sketch of expected behavior:

# a creation time
if FRONT_ARTICLE is None:
  FRONT_ARTICLE = guess_from_mimetype()
if FRONT_ARTICLE:
    add_to_titleListing()
    if with_xapian:
        add_to_xapian_title_index()
        if item.has_index_data:
            add_to_xapian_content_index()

# suggest()
if has_xapian_title_index():
   find_entries_in_index()
else:
    find_entries_in_listing()

# search()
if has_xapian_index():
   find_entries_in_index_matching()

what do you think?

kelson42 commented 2 years ago

Seems indeed a bit hard to understand and hard to know how to simplify :) Should we keep this for the hackathon?

rgaudin commented 2 years ago

Yes, that's a good idea. Adding it to the Wiki and adapting scraperlib tests to current behavior.

kelson42 commented 2 years ago

We need first to implement openzim/libzim#642 to then be able to check if we could close this ticket.

rgaudin commented 2 years ago

And #92 as well

kelson42 commented 2 years ago

https://github.com/openzim/libzim/issues/642 has been implemented. What should we do with this ticket?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 2 years ago

@mgautierfr @rgaudin I think (have unterstood) that this ticket is super straight forward. Can we go forward. Who? Should reassign it to @mgautierfr

rgaudin commented 2 years ago

Tested above code with current codebase and got the result that past-me said was expected.

openzim / python-libzim

Hints are not used #125