kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
408 stars 79 forks source link

& symbol in search box still not redirecting to article #588

Closed nijazm closed 1 year ago

nijazm commented 1 year ago

Okay, looks like you were fixing something but unsuccessfully. I just tested yesterday's nightly version of kiwix desktop and kiwix tools on Windows 11. Now just shows fulltext search autocomplete result for & symbol and when I click on it, it says No results were found for "&". In search box it shows containing '&'. The same happens in kiwix serve (web browsers) and kiwix desktop app. Tested with english wikipedia 2021-12. The only difference is that now titles containing & redirect properly (previously they did not), e.g. Me, Myself & Irene

kelson42 commented 1 year ago

I don't understand this bug report. Can someone rephrase it please, https://github.com/kiwix/overview/blob/master/REPORT_BUG.md

veloman-yunkan commented 1 year ago

This ticket is a follow-up of #587 after one bug was fixed by kiwix/libkiwix#859 exposing another unrelated problem.

The essence of the problem is as follows.

English wikipedia contains an article with title & (that redirects to Ampersand).

A user exploring the wikipedia_en_all ZIM file via kiwix-serve expects that entering the & symbol in the ZIM viewer searchbox will suggest them a link leading to that article. Instead they are presented only with a suggestion to perform a full-text search for the text &, which still doesn't produce any results.

veloman-yunkan commented 1 year ago

As hypothesized in https://github.com/kiwix/kiwix-tools/issues/587#issuecomment-1354495921, the problem is that the ampersand symbol is treated as punctuation and is simply discarded during the creation of the title index as well as when running suggestion search on it.

Ideally, while building the title index we should handle article names consisting of a single symbol or word in a special way, letting those terms go into the title index as is despite any rules that drop punctuation and stopwords. Also we will have to enhance the suggestion search so that it accounts for such an addition to the title index.

kelson42 commented 1 year ago

@veloman-yunkan Thank you for the explanation and analysis. Do you know exactly which part of the code removes this? Is that related the stop words? Your proposal seems worth to be considered IMO. I believe this special handling here might be pretty independant of any special character but impacting any really short titles.

veloman-yunkan commented 1 year ago

Do you know exactly which part of the code removes this?

@kelson42 No, I don't.

kelson42 commented 1 year ago

@mgautierfr If there is only stop word(s) OR punctions in a title we should keep them IMO. Does that make sense?

mgautierfr commented 1 year ago

I would say that we try to clean the query (or the title to index). And if the clean query(/title) is empty then we use the original string instead of the cleaned one. We don't care about what the original string is composed of.

kelson42 commented 1 year ago

@mgautierfr Should we move this ticket to openzim/libzim?

mgautierfr commented 1 year ago

yes