Multizimsearch : search with multiple language databases

fheslouin commented 2 years ago

Searching for multiple keyword gives empty results, see tests cases below.

Example :

http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences+materielles+pour+utiliser+Ubuntu gives empty results
http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences materielles pour utiliser Ubuntu gives empty results
http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences%20materielles%20pour%20utiliser%20Ubuntu even with encode url, result is empty

However single key word gives results : http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences

kelson42 commented 2 years ago

@fheslouin This is a regression in comparison to earliers tests?

fheslouin commented 2 years ago

That was the first time I was trying multiple key word.

kelson42 commented 2 years ago

@mgautierfr If confrmed, obviously top prio to fix.

mgautierfr commented 2 years ago

http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences returns indeed something, but only one result, and in Spanish.

A search for http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=silently+corrected (silently corrected) being in the result returned previously, return now two results. (So it seems, two words are working)

On top of that : http://kiwix.kiwix-test.ideas-box.cc/search?content=ubuntudoc_fr_all_2015-12&pattern=Exigence tell us that ubuntudoc_fr_all_2015-12 has no fulltext index, so it may explain why we have no results for Exigences materielles pour utiliser Ubuntu

However, there is a issue : http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences&books.filter.lang=fra returns a lot of results in french zim. But http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences returns only one result in Spanish.

I cannot reproduce it with my small test (wikinews_fr_all_maxi_2022-05.zim and gutenberg_es_all_10_2014.zim). Is it possible to have a archive with all the zim files you use and the command line of kiwix-serve ?

kelson42 commented 2 years ago

@mgautierfr Can you please open a dedicated ticket for the bug you have found? And fix it? @fheslouin Can you please provide a simplified reproduction case?

fheslouin commented 2 years ago

@mgautierfr here is the command we use https://gitlab.com/bibliosansfrontieres/olip/apps/kiwix/blob/master/supervisor/app.conf#L6 along this list of zim : http://kiwix.kiwix-test.ideas-box.cc/

@kelson42 what do you mean by "simplified reproduction case" ?

mgautierfr commented 2 years ago

@fheslouin, the zim files in use in your configuration are pretty "old". There are not available in https://download.kiwix.org/zim. Where can I download them ?

fheslouin commented 2 years ago

@mgautierfr indeed they are quite old, you can find them in our catalog

mgautierfr commented 2 years ago

The problem is not the multiple key word search, but the multiple language search. If you are not selecting books, we are searching in all books. The first book in the list is gutenberg_es_all_2018-10.zim (simply because the uuid is smaller than the others) and so, the search engine is initialized for a Spanish search. If you are selecting only french books (books.filter.lang=fra), all books are in french and search search engine is initialized for a French search.

In French, Exigence is stemmed to exigent but in Spannish it is stemmed to exigenc. And xapian doesn't find exigenc in French database.

I still don't know what to do with this however.

kelson42 commented 2 years ago

@mgautierfr We are talking here about steeming. Steeming is not mandatory for a search AFAIK. If the books have different languages, then we should not use steeming IMO.

mgautierfr commented 2 years ago

It is as we keep only the stemmed term in the xapian database to save data. So we cannot search for unstemmed as they don't exists.

kelson42 commented 2 years ago

@mgautierfr Then I hardly see how we coukd fix the problem without running many searches (one per language) and merge the results.

mgautierfr commented 2 years ago

Interesting (but pretty small) discussion about that : https://lists.tartarus.org/pipermail/xapian-discuss/2009-July/006942.html

mgautierfr commented 2 years ago

Then I hardly see how we coukd fix the problem without running many searches (one per language) and merge the results.

The problem with merging is that we may have wrong score for documents. One document may have a great score in the context of one database where it should be pretty low in a multidatabase context. Think about the only article matching in the Spanish zim : Diccionario Ingles-Español-Tagalog, I don't know the score, but we may think it is pretty good as it is the only document corresponding in the Spanish database. But if you search for Exigence in all the databases, you probably want the score to be really low. And here I took a exemple with different language as it is what we have here, but it is not language related. A search on a disease on a football zim can return document with very high score (as football is not about medicine), but the same document in a multizim search football and medicine, will have a very small score.

Xapian does this scoring in regard of the whole search context for us (at least I assume that). I don't know how to do it on our side.

mgautierfr commented 2 years ago

I think the correct way to fix it, is to associated the language of the query to the query itself and not use the language of the database(s). They are two different things. If user is searching is french, we must stem/stop words using french language, whatever is the language of the zim file.

kelson42 commented 2 years ago

@mgautierfr If I understand you properly, this is more or less how it works today, and it does not give the expected result in multizim+multilanguage context. It is not satisfying to me.

mgautierfr commented 2 years ago

No, for now, we are using the language of the (first) database. What I proposes is to NOT use the language of the database but use the language of "whatever user language is". How we define this language is still open:

Use the same language than the one used for translation ?
Add a explicit UI way to let user specify in which language he want to search ?
Other ?

In our case, if we know that user search for Exigence in French, then we would have a correct results (We would have no Spanish results, but it would be "normal")

kelson42 commented 2 years ago

@mgautierfr I see no user friendly way to know this language. On the top of this, there is many words which are the same in many languages. I don't see how this approach could deliver what the user wants: articles with occurences of the given word(s), whatever the languages of the selected zims.

mgautierfr commented 2 years ago

I propose that we add a query parameter queryLanguage=<iso3language>. This parameter specify how we must parse the query and is different from book.filter.lang which specify how we select the books. If the paramater is not provided, it is defaulted this way:

In case of single zim search, use the language of the database (so it behaves as now)
In case of multi zim search, use the user language (the same as we use for the translation). Ie it is userlang parameter or Accept-Language header

It should provide kind of sane default queryLanguage and still allow searches in different languages.

kelson42 commented 2 years ago

@mgautierfr Your approach dismisses the multilanguage search which is the goal I aim for. In addition your scenario seems to link the problem to a multizim search, but actually there is no problem with multizim search as long as they are in the same language. The problem here is multilanguage search: one search pattern with ZIM in different languages.

I have requested opinion from the Xapian team and here is what I have got for a response https://lists.xapian.org/pipermail/xapian-discuss/2022-August/009953.html.

To me the questions are now:

Do we index the unstemmed words as well? "yes" we could try to stop to steem in such a scenario, "no" we could try to do so (maybe the index would not be a lot larger?)
The second proposal with OP_OR seems interesting, would we have any problems doing so?
The Lfr seems to me to be a hack
To me the 1.5 solution seems to be the good one (with public API)... we could disallow multilanguage search for the moment and implement this when 1.5 will be released.

mgautierfr commented 2 years ago

We are storing both stemmed and unstemmed words. And we stem all words in the queryParser. Indeed, we may try to not stem words (or search for both stemmed and unsteemed words). It should works, I will try.

The Lfr is not so a hack. And we probably want it at a moment. It is probably the best way to handle different languages in the same zim.

Merging two results set is indeed tricky. It is nice to know that a public API will come in next Xapian release. It will allow us some nice improvement, independently of multilanguages. For now we need to open xapian database for each "context". If we search on A and then on A and B. We need to open twice A. If we could merge resultset, we could improve our codebase to open A only once (at the time)

mgautierfr commented 2 years ago

My bad, we are not storing unstemmed words in the fulltext database (we do it in the title database).

The solutions for now are:

Disallow multi-language search.
Allow search on multizim with different language, but do the search for only one language.
Build a query with use the stemmed words in different languages and OP_OR (but we may have false result as a stemmed word in language A may return result for a different word in language B) (but I don't know the gravity of this)
Wait for Xapian 1.5 and merge the different "single zim searchs" in a multi search results. (And it would work for multilanguage too)
Add a language tag to each document. Then the database are considered as multilanguage by definition. [New indexation strategy]
Store unstemmed word in the xapian database and search for unstemmed word too [New indexation strategy]

kelson42 commented 2 years ago

@mgautierfr @veloman-yunkan I’m sorry to bring back this ticket in 12.0.0 but we remarked that it was a blocker to finish a contracted project. That said I don’t expect us to fully fix the problem here, but we should fix the part which is the most obvious.

To me either all the ZIM have the same language and all is good, we can continue to return results like now without requesting a language. But, otherwise, we need to require a mandatory language parameter and return the results only for the ZiM in this language. If missing then return an error.

Once this is done, we will have to make decision if there is a reasonable way to follow to allow in the future multizim searches on ZIM files in differents languages or even ZIM files with multilanguage content.

veloman-yunkan commented 2 years ago

@mgautierfr I can work on this

mgautierfr commented 2 years ago

I agree with @kelson42 solution. If we are in single language search (so also single zim search by definition, until we store different languages in the same zim :) ), do the search in this language. Else "ask" for a specific language from the user. There is still the open question of a single language (A) search with an explicit search language (B) provided by the user. Should we do the search in A or B ?

kelson42 commented 2 years ago

There is still the open question of a single language (A) search with an explicit search language (B) provided by the user. Should we do the search in A or B ?

This should always return zero result, because there is no zim with language B (we should run the search request only zim with the same languahe as the one requested).

kelson42 commented 2 years ago

@Bastien-BSF If good for you we would proceed by doing so. The fundamental fix regarding multizim and multilanguage search would be done later, its not OPDS related and is a scenario which has never been supported properly.

Bastien-BSF commented 2 years ago

That's good for me. Thanks.

kelson42 commented 2 years ago

@veloman-yunkan Could you please implememt what has been decide in https://github.com/kiwix/libkiwix/issues/785#issuecomment-1272571526? I have reassigned to you the ticket following a discussion with @mgautierfr and with his agreement. That way he could focus on finish something to make next release of libzim.

kelson42 commented 2 years ago

I will no create a new ticket in openzim/libzim regarding the conplex problem around multizim search with multiple languages... to then close that that one.

kelson42 commented 2 years ago

Closed in favour of https://github.com/openzim/libzim/issues/734

kiwix / libkiwix

Multizimsearch : search with multiple language databases #785