openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
166 stars 49 forks source link

Weird suggestion behavior when indexdata is passed but item is not marked as "FRONT_ARTICLE" #902

Closed benoit74 closed 3 months ago

benoit74 commented 3 months ago

I'm creating helper class in python-scraperlib to pass custom indexdata to the libzim

Code is pretty straightforward

""" Special item with customized index data and helper classes """

from __future__ import annotations

import io
import pathlib
from typing import Any

import libzim.writer  # pyright: ignore

from zimscraperlib.zim.items import StaticItem

class IndexData(libzim.writer.IndexData):
    """IndexData to properly pass indexing title and content to the libzim

    Both title and content have to be customized (title can be identical to item title
    or not).
    """

    def __init__(self, title: str, content: str):
        self.title = title
        self.content = content

    def has_indexdata(self):
        return len(self.content) > 0 or len(self.title) > 0

    def get_title(self):
        return self.title

    def get_content(self):
        return self.content

    def get_keywords(self):
        return ""

    def get_wordcount(self):
        return len(self.content.split()) if self.content else 0

class IndexingItem(StaticItem):
    """A StaticItem capable to customize indexing data (title and content).

    When indexdata (with title and content) is passed, this is used for indexing.
    """

    def __init__(
        self,
        content: str | bytes | None = None,
        fileobj: io.IOBase | None = None,
        filepath: pathlib.Path | None = None,
        path: str | None = None,
        title: str | None = None,
        mimetype: str | None = None,
        hints: dict | None = None,
        indexdata: IndexData | None = None,
        **kwargs: Any,
    ):

        super().__init__(
            content, fileobj, filepath, path, title, mimetype, hints, **kwargs
        )

        # indexdata has been externaly configured
        if indexdata:
            self.get_indexdata = lambda: indexdata
            return

This code works well when is hint FRONT_ARTICLE is set to True: item title data is used for suggestions, and index data title and content are used for search. Following test pass:

def test_indexing_item_is_front(tmp_path, png_image):
    """Create a ZIM with a single item with customized title and content for indexing"""
    fpath = tmp_path / "test.zim"
    main_path = "welcome"
    with Creator(fpath, main_path).config_dev_metadata() as creator:
        creator.add_item(
            IndexingItem(
                filepath=png_image,
                path="welcome",
                title="brain food",  # title used for suggestions
                indexdata=IndexData(
                    title="screen", content="car"  # title and content used for search
                ),
                hints={libzim.writer.Hint.FRONT_ARTICLE: True},
            )
        )
    assert fpath.exists()

    reader = Archive(fpath)
    assert "welcome" in list(reader.get_suggestions("brain"))
    assert "welcome" in list(reader.get_suggestions("food"))
    assert "welcome" not in list(reader.get_suggestions("screen"))
    assert "welcome" not in list(reader.get_suggestions("car"))
    assert reader.get_search_results_count("screen") >= 1
    assert reader.get_search_results_count("car") >= 1
    assert reader.get_search_results_count("brain") == 0
    assert reader.get_search_results_count("food") == 0

However when is item is passed with the hint FRONT_ARTICLE at False, result are weird. Following test pass:

def test_indexing_item_not_front(tmp_path, png_image):
    fpath = tmp_path / "test.zim"
    main_path = "welcome"
    with Creator(fpath, main_path).config_dev_metadata() as creator:
        creator.add_item(
            IndexingItem(
                filepath=png_image,
                path="welcome",
                title="brain food",  # title used for suggestions
                indexdata=IndexData(
                    title="screen", content="car"  # title and content used for search
                ),
                hints={
                    libzim.writer.Hint.FRONT_ARTICLE: False  # mark as not front (silly)
                },
            )
        )
    assert fpath.exists()

    reader = Archive(fpath)
    assert "welcome" in list(reader.get_suggestions("brain"))
    assert "welcome" not in list(reader.get_suggestions("food"))
    assert "welcome" not in list(reader.get_suggestions("screen"))
    assert "welcome" not in list(reader.get_suggestions("car"))
    assert reader.get_search_results_count("screen") >= 1
    assert reader.get_search_results_count("car") >= 1
    assert reader.get_search_results_count("brain") == 0
    assert reader.get_search_results_count("food") == 0

I don't get why brain is returning a suggestion while food is not. I don't know whether the hint should be respected or simply ignored, but at least it should provide consistent behavior of suggestions.

mgautierfr commented 3 months ago

If no front article is present in the zim file, libzim doesn't create a title xapian index. So, when searching suggestion, libzim is fallback to a binary search on the title and return only article starting by the query.

If you add another front article (and keep "brain food" not being a front article), it should behave correctly (well, it is already, but lets say as expected) and not return "brain food" for "brain" query.

kelson42 commented 3 months ago

@mgautierfr Please secure, if not already done, this is fully documented.

benoit74 commented 3 months ago

Great, thank you !

mgautierfr commented 3 months ago

The fallback to binary search is already told at the end of https://libzim.readthedocs.io/en/latest/usage.html#searching-for-suggestions

The fact that no xapian database is created if no front article is present is not written (but I remember we already speak about that on github is one issue). This is a pretty specific use case to create a zim file and specifying for all item that FRONT_ARTICLE is false. By default (when no hint is given), libzim with set item as front article if it is html (as explained in https://github.com/openzim/libzim/issues/642) So, except in some testing edge case (as here), this should not be the case.