openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 20 forks source link

get_suggestions_results_count result is incorrect #72

Closed rgaudin closed 4 years ago

rgaudin commented 4 years ago

The reader provides the suggestion feature through two methods:

--


def test_title(reader, title):
    nb = reader.get_suggestions_results_count(title)
    res = list(reader.suggest(title))
    print(title, "--", nb, len(res), res)

fpath = pathlib.Path("test.zim")

with Creator(fpath, "home", "fra") as creator:
    creator.add_article("home", title="Original", content="hello")
    creator.add_redirect("A/home2", "A/home", "Something")
    creator.add_redirect("A/home3", "A/home", "Something2")
    creator.add_redirect("A/home4", "A/home", "Else")
    creator.add_article("lalala", title="Lalala", content="hello again")

with libzim.reader.File(fpath) as reader:
    print("nb article", reader.article_count)
    test_title(reader, "Original")
    test_title(reader, "Something")
    test_title(reader, "Else")
    test_title(reader, "Lala")
nb article 10
Original -- 2 1 ['A/home']
Something -- 3 2 ['A/home2', 'A/home3']
Else -- 2 1 ['A/home4']
Lala -- 1 1 ['A/lalala']
mgautierfr commented 4 years ago

get_suggestions_results_count is a wrapper around the xapian function (even internally in libzim). It is the xapian code that estimate the number of result. (But the exact function name is get_matches_estimated https://github.com/openzim/python-libzim/blob/master/libzim/wrapper.pyx#L564, implying that the number is not exact).

I don't know what happen here. And if we can do something.


Beside that, get_suggestions_results_count suggest both start a new suggestion search (on the wrapper side). It would be better to avoid that, but it's another problem.

rgaudin commented 4 years ago

Well these numbers are not completely wrong, those represent the number of entry for the query. The discrepancy with suggest() lies in the fact that suggest only offers results to direct articles and not the redirects.

Maybe we should consider it an indexing problem: as every redirect appear to count as a match for the query, maybe there's a way to not index those as matches for the target article and not the redirect ?

mgautierfr commented 4 years ago

No, we don't do any filtering about the redirection. For "classical article", we index the title and the content. For redirect article, we index the title only (of the redirect article, not the target).

So we never associate the title "Original" to the redirection "A/home2". And if it was as you suggest, we should have 4 suggestions count (1 article + 3 redirects) but we have only 2.

rgaudin commented 4 years ago

OK, I see. Then I don't know what we can do about it 🤷‍♂️