openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
166 stars 49 forks source link

Put search result with exact title in first position. #766

Open mgautierfr opened 1 year ago

mgautierfr commented 1 year ago

When user search for a term, we should put article with the exact same title first in the result. (Same for suggestion)

See comments in #653

danielzgtg commented 1 year ago

See https://github.com/kiwix/kiwix-android/issues/2033 and https://github.com/kiwix/kiwix-android/pull/2035 which I created 3 years ago.

kelson42 commented 1 year ago

@mgautierfr We need concrete examples and explanations about why this is not the case today. We should talk about "suggestions" as this is what this is about.

If the example is the "apple", then the only way i see is too add a layer on the top of Xapian and I'm against it because it has been done at least twice (even by @mgautierf) and this was bringing more problems than goods. Not in favour of redoing same errors over time.

Jaifroid commented 1 year ago

By "accident", Kiwix JS does something like this. The accident is that we only very recently got full-text searching (thanks to the libzim WASM), so I grafted ft search on top of existing title search. Because ft search is considerably slower than title search, we get search results coming in a two-stage process: exact prefix matching first (pseudo-case-insensitive), and then a few seconds later, the ft results (which are pruned to remove any duplicates before displaying them).

NB We can't currently provide any "snippets", because that part of the API isn't yet bound to JavaScript. (It might be too slow, anyway.)

kelson42 commented 1 year ago

@Jaifroid I hardly believe doing what you describe implements this feature because ft search does not implement this feature either.

danielzgtg commented 1 year ago

It might be too slow, anyway

kiwix-android search feels slower than kiwix-js anyway. It spins for noticeably longer than on desktop.

ft search does not implement this feature either.

Perhaps something else in kiwix-js is implementing this. On kiwix-js I can actually find what I'm looking for in the first result, while on kiwix-android I have to scroll and look through a bunch of random results

rgaudin commented 1 year ago

@Jaifroid I hardly believe doing what you describe implements this feature because ft search does not implement this feature either.

You've read it wrong I believe. @Jaifroid said there is a Title-prefix search displayed (sort of like suggestions) while the FT search is requested in the background and once FT results are ready, those are added to the page (removing the entries that were already there from the prefix search).

rgaudin commented 1 year ago

Regardless of how practical it is to implement, I support the feature request as this IMO a very common scenario: you type a request, you get the suggestions but it's not giving you exactly what you wanted. So you type hoping for better results, expecting those entries to be present anyway.

Jaifroid commented 1 year ago

I guess we do need a proper specification of the problem. Kiwix Desktop (and Kiwix Serve) seem to do a version of prefix matching if you enter more than a single word, but we get a slightly unintuitive list of results for single words

I compared searching for "caribbean basin" in Kiwix JS and Kiwix Desktop (see top screenshot, full English Wikipedia) -- almost exaclty the same results for the title search (outlined in red). But with "apple" we get a very different search result order, with the first result matching the fruit being the one outlined in red in each case (bottom screenshot).

To be clear, Kiwix JS title search is not intelligent or weighted in any way, it merely does a binary search on as many upper-case and lower-case variants of the entered prefix as it can. and gathers anything that matches the prefix. It then fills up the rest of the space (up to the max search results requested, default 30, but user-selectable) with full-text search results (from which duplicates are removed).

Search_comparison apple_search

kelson42 commented 1 year ago

@rgaudin Honestly, I have no real clue honestly what this ticket is about as there is not concrete example of input/output... If this is not done I will close the ticket as I can not follow what all this is about.

mgautierfr commented 1 year ago

My initial idea was about search for term. If you search for "Apple" on wikipedia_en_all, you have this list (https://library.kiwix.org/viewer#search?content=wikipedia_en_all_maxi_2023-02&pattern=apple):

The idea is to "move" the "Apple" result (the article with a title equal (case insensitive) to the search term) on top of the list as it is probably a really relevant result.

How the "move" is implemented is still open to discussion. It could be specific criteria in xapian to give the highest score to "Apple" article, or it could be the libzim iterator starting with "Apple" and then with the classic xapian results (skip in the "Apple" article in them), or libkiwix itself inserting the result in the html page (maybe with a specific section), or ...

But as @Jaifroid suggests in its last comment, we could also do the same for suggestions.

This could be compared with https://github.com/kiwix/libkiwix/issues/748. We were redirecting directly to the exact title article in case of search. Now we are not redirecting, but at least we could put the exact title article first.

kelson42 commented 1 year ago

@mgautierfr To me, if the ticket seems obvious for suggestions, it sounds far less obvious for ft search. If I ft search "Verdun", would be kind of expecting "Battle of Verdun" as first result, but if I search a suggestion, kind of expect "Verdun" as first result.

In both cases, this is the job of Xapian to deliver things properly... see no fundamental reason it could not.

Jaifroid commented 1 year ago

In both cases, this is the job of Xapian to deliver things properly...

What happens for ZIMs that don't have a Xapian index? Presumably fallback to binary search of Directory Entry titles.

rgaudin commented 1 year ago

@mgautierfr To me, if the ticket seems obvious for suggestions, it sounds far less obvious for ft search. If I ft search "Verdun", would be kind of expecting "Battle of Verdun" as first result, but if I search a suggestion, kind of expect "Verdun" as first result.

In both cases, this is the job of Xapian to deliver things properly... see no fundamental reason it could not.

I think there are two distinct discussions here: what we'd want to get and how to implement it. It's usually more efficient to define the former first and then try to reconcile with the second.

Away from all technical considerations, I believe if there is an entry matching the exact search query, it should be highlighted. It can be the first result or a different card or anything that tells the user “you've requested this, we have it”. Keep in mind that from a user's perspective the differences between suggestions and search are:

So it's reasonable to assume that a suggested Entry can be considered but user would like more details before discarding it.

In terms of UX, I think I'd even want if that matching Entry is a redirect to have something like “Le great XXX (redirection from XXX)”

I'd be careful with examples (in this ticket! Not in other related to improving search) as you seem to incorporate cultural background to it. We can design various scoring mechanism so that we influence the sorting of search results.

In your example, on WPEN that battle article is not the first result. Verdun, the city, is. WPFR is similar but it could be different. That's a discussion about sorting and it's not what this ticket is about.

This ticket is about a UX improvement of asserting that the exact search query has a matching result and this could be highlighted.

I agree the ticket title is a bit incorrect as it suggests a technical solution.

kelson42 commented 1 year ago

In both cases, this is the job of Xapian to deliver things properly...

What happens for ZIMs that don't have a Xapian index? Presumably fallback to binary search of Directory Entry titles.

This topic It's not a prority considering we don't produce this kind of ZIM files. That said, considering the logic of dichotomy finding, this should be already the case IMHO.

danielzgtg commented 1 year ago

I think it's important to have a concrete example. It's impossible to objectively measure whether the bug is fixed or not without a test case.

it should be highlighted. It can be [...] or a different card

No, it shouldn't be a different card. On desktop, I want to just press the enter key without looking. On mobile, I want to tap the first search result row with my eyes closed.

Simple Wikipedia

I will be using https://library.kiwix.org/viewer#wikipedia_en_simple_all_mini_2023-03/A/Main_Page .

Example 1: apple

Expected behavior

Apple
Apple & Onion
Apple (company)
Apple (disambiguation)
Apple (tree)
Apple A10
Apple A10X
Apple A11
Apple A4
Apple A5

Actual Behavior

Apple
Adam's apple
Apple & Onion
Apple (company)
Apple (disambiguation)
Apple (tree)
Apple A10
Apple A10X
Apple A11
Apple A4

Example 2: mountain

Expected behavior

Mountain
Mountain (band)
Mountain Ash
Mountain Ash, Rhondda Cynon Taf
Mountain Avens
Mountain Brook, Alabama
Mountain Daylight Time
Mountain Dew
Mountain Gorilla
Mountain Grove, Missouri

Actual Behavior

Mountain
Baekdu Mountain
Bare Mountain
Bear Mountain
Brokeback Mountain
Daniel (mountain)
Death Mountain
Deomali (mountain)
Fold mountain
Folded mountain

Example 3: library

Expected behavior

Library
Library Network of Western Switzerland
Library Tower
Library and Archives Canada
Library classification
Library of Alexandria
Library of Birmingham
Library of Celsus
Library of Congress
Library of Congress Control Number

Actual Behavior

Library
1949 (library)
Bodleian Library
Bodleian library
British Library
Carnegie library
Library Tower
Library classification
National library
National library

Wiktionary

This bug is worse with wiktionary which I mainly use Kiwix for, but there are less users compared to wikipedia. In wiktionary, the exact result doesn't even appear first. I will use wiktionary_en_all_maxi_2023-02.zim but it doesn't work in unpatched library.kiwix.org. It works on staging kiwix-js, and normal kiwix-android.

Example 4: des

Expected behavior

des
des Abends
des Morgens
des Pudels Kern
des Weiteren
des avonds
des de
des doods
des families
des fois que

Actual Behavior

-des
-deş
DES
DEs
Des
dEs
des
des-
deś
deš

Example 5: que

Expected behavior

que
que Dios te bendiga
que aproveche
[redacted]
que chuta
que colsaconste
[redacted]
que demande le peuple
que descanse en paz
[redacted]

Actual Behavior

'que
-que
QUE
Que
Que.
Que(^')
que
què
qué
quê

The more intelligent suggestion behavior from https://simple.wikipedia.org/wiki/Main_Page that uses statistics is also good.

mgautierfr commented 1 year ago

I agree the ticket title is a bit incorrect as it suggests a technical solution.

This ticket is a response to https://github.com/openzim/libzim/issues/653#issuecomment-1466377543 stating we need other implementation idea to discuss the need of a feature.

To me, if the ticket seems obvious for suggestions, it sounds far less obvious for ft search. If I ft search "Verdun", would be kind of expecting "Battle of Verdun" as first result, but if I search a suggestion, kind of expect "Verdun" as first result.

I don't see why we should have "Battle of Verdun" as first result. If I search for Hiroshima or Nagasaki I want to have information about the city, no about a (important) event happened years ago.

Interestingly, search on en.wikipedia.org for "Verdun", "Hiroshima" or "Nagasaki" give the exact article title first. But a search for "Bir Hakeim" gives the "Battle of Bir-Hakeim" first and "Bir-Hakeim" second.

This let me think that the "natural" (relevance) sorting of wikipedia give a lot of importance to the exactitude of the title but this is not the only criteria to select the first result.


@danielzgtg Your example seems to be base on suggestion. It is right ? And your expected behavior must be a bit clarified. How do you choose the order of the article ?

Your example with wikionnary is interesting. As we stem the words, we have all titles 'que, Que, Qué, ... reduced to the same stem and, as the title is only one word, xapian have no clue about how to sort the results.

Jaifroid commented 1 year ago

The expected order listed above is, in each case, the order given by binary search of the title order list of directory entries, augmented by testing for several common case variations. So, when entering library, a search is also done for Library (and LIBRARY). This is the algorithm used in Kiwix JS browser extension version (augmented by full-text search a few seconds later, if it is available and if we haven't already got 30 results from binary search). Kiwix JS has no concept of "suggestions".

This algorithm is highly effective for Wikipedia/Wiktionary, but *_almost useless_* for any ZIM where the alphabetical title order is meaningless (in a Stack Exchange ZIM, the title of many articles/questions will begin with "What...", and the key word will be buried somewhere in the title).

The reason it is highly effective for Wikipedia/Wiktionary is because editors of articles add lots of redirects from common search terms (often including common misspellings and common case variants) to the underlying article). So, we effectively have a "pre-weighted" and augmented alphabetical search index. It makes sense to leverage this, if possible.

danielzgtg commented 1 year ago

"Battle of Bir-Hakeim" first and "Bir-Hakeim" second

I'm fine with Wikipedia doing that because pressing enter will go to the exact search result if found. However someone declined my suggestion for adding this at https://github.com/kiwix/kiwix-android/issues/2033#issuecomment-619817915 , so I need the exact search result at the top.

How do you choose the order of the article ?

The expected order listed above is

Exactly as Jaifroid described for kiwix-js.

Stack Exchange ZIM

I never thought of that. But that should be done together with some kind of intelligent ranking feature. The ranking should pay less attention to stopwords and more attention to highly ranked questions/answers. Anyway, that would be more complicated to implement than the change described in this GitHub issue.

reduced to the same stem and, as the title is only one word, xapian have no clue about how to sort the results.

This behaviour from kiwix-android makes the app hard to use. Therefore, we should implement the original request in this GitHub issue.