Closed fheslouin closed 2 years ago
@fheslouin This is a regression in comparison to earliers tests?
That was the first time I was trying multiple key word.
@mgautierfr If confrmed, obviously top prio to fix.
http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences returns indeed something, but only one result, and in Spanish.
A search for http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=silently+corrected (silently corrected
) being in the result returned previously, return now two results. (So it seems, two words are working)
On top of that : http://kiwix.kiwix-test.ideas-box.cc/search?content=ubuntudoc_fr_all_2015-12&pattern=Exigence tell us that ubuntudoc_fr_all_2015-12 has no fulltext index, so it may explain why we have no results for Exigences materielles pour utiliser Ubuntu
However, there is a issue :
http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences&books.filter.lang=fra
returns a lot of results in french zim. But http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences
returns only one result in Spanish.
I cannot reproduce it with my small test (wikinews_fr_all_maxi_2022-05.zim
and gutenberg_es_all_10_2014.zim
). Is it possible to have a archive with all the zim files you use and the command line of kiwix-serve ?
@mgautierfr Can you please open a dedicated ticket for the bug you have found? And fix it? @fheslouin Can you please provide a simplified reproduction case?
@mgautierfr here is the command we use https://gitlab.com/bibliosansfrontieres/olip/apps/kiwix/blob/master/supervisor/app.conf#L6 along this list of zim : http://kiwix.kiwix-test.ideas-box.cc/
@kelson42 what do you mean by "simplified reproduction case" ?
@fheslouin, the zim files in use in your configuration are pretty "old". There are not available in https://download.kiwix.org/zim. Where can I download them ?
@mgautierfr indeed they are quite old, you can find them in our catalog
The problem is not the multiple key word search, but the multiple language search.
If you are not selecting books, we are searching in all books.
The first book in the list is gutenberg_es_all_2018-10.zim
(simply because the uuid is smaller than the others) and so, the search engine is initialized for a Spanish search.
If you are selecting only french books (books.filter.lang=fra
), all books are in french and search search engine is initialized for a French search.
In French, Exigence
is stemmed to exigent
but in Spannish it is stemmed to exigenc
.
And xapian doesn't find exigenc
in French database.
I still don't know what to do with this however.
@mgautierfr We are talking here about steeming. Steeming is not mandatory for a search AFAIK. If the books have different languages, then we should not use steeming IMO.
It is as we keep only the stemmed term in the xapian database to save data. So we cannot search for unstemmed as they don't exists.
@mgautierfr Then I hardly see how we coukd fix the problem without running many searches (one per language) and merge the results.
Interesting (but pretty small) discussion about that : https://lists.tartarus.org/pipermail/xapian-discuss/2009-July/006942.html
Then I hardly see how we coukd fix the problem without running many searches (one per language) and merge the results.
The problem with merging is that we may have wrong score for documents.
One document may have a great score in the context of one database where it should be pretty low in a multidatabase context.
Think about the only article matching in the Spanish zim : Diccionario Ingles-Español-Tagalog
, I don't know the score, but we may think it is pretty good as it is the only document corresponding in the Spanish database.
But if you search for Exigence
in all the databases, you probably want the score to be really low.
And here I took a exemple with different language as it is what we have here, but it is not language related. A search on a disease on a football zim can return document with very high score (as football is not about medicine), but the same document in a multizim search football and medicine, will have a very small score.
Xapian does this scoring in regard of the whole search context for us (at least I assume that). I don't know how to do it on our side.
I think the correct way to fix it, is to associated the language of the query to the query itself and not use the language of the database(s). They are two different things. If user is searching is french, we must stem/stop words using french language, whatever is the language of the zim file.
@mgautierfr If I understand you properly, this is more or less how it works today, and it does not give the expected result in multizim+multilanguage context. It is not satisfying to me.
No, for now, we are using the language of the (first) database. What I proposes is to NOT use the language of the database but use the language of "whatever user language is". How we define this language is still open:
In our case, if we know that user search for Exigence
in French, then we would have a correct results (We would have no Spanish results, but it would be "normal")
@mgautierfr I see no user friendly way to know this language. On the top of this, there is many words which are the same in many languages. I don't see how this approach could deliver what the user wants: articles with occurences of the given word(s), whatever the languages of the selected zims.
I propose that we add a query parameter queryLanguage=<iso3language>
.
This parameter specify how we must parse the query and is different from book.filter.lang
which specify how we select the books.
If the paramater is not provided, it is defaulted this way:
userlang
parameter or Accept-Language
headerIt should provide kind of sane default queryLanguage and still allow searches in different languages.
@mgautierfr Your approach dismisses the multilanguage search which is the goal I aim for. In addition your scenario seems to link the problem to a multizim search, but actually there is no problem with multizim search as long as they are in the same language. The problem here is multilanguage search: one search pattern with ZIM in different languages.
I have requested opinion from the Xapian team and here is what I have got for a response https://lists.xapian.org/pipermail/xapian-discuss/2022-August/009953.html.
To me the questions are now:
OP_OR
seems interesting, would we have any problems doing so?Lfr
seems to me to be a hackWe are storing both stemmed and unstemmed words. And we stem all words in the queryParser. Indeed, we may try to not stem words (or search for both stemmed and unsteemed words). It should works, I will try.
The Lfr
is not so a hack. And we probably want it at a moment. It is probably the best way to handle different languages in the same zim.
Merging two results set is indeed tricky. It is nice to know that a public API will come in next Xapian release. It will allow us some nice improvement, independently of multilanguages. For now we need to open xapian database for each "context". If we search on A
and then on A and B
. We need to open twice A
. If we could merge resultset, we could improve our codebase to open A only once (at the time)
My bad, we are not storing unstemmed words in the fulltext database (we do it in the title database).
The solutions for now are:
OP_OR
(but we may have false result as a stemmed word in language A may return result for a different word in language B) (but I don't know the gravity of this)@mgautierfr @veloman-yunkan I’m sorry to bring back this ticket in 12.0.0
but we remarked that it was a blocker to finish a contracted project. That said I don’t expect us to fully fix the problem here, but we should fix the part which is the most obvious.
To me either all the ZIM have the same language and all is good, we can continue to return results like now without requesting a language. But, otherwise, we need to require a mandatory language parameter and return the results only for the ZiM in this language. If missing then return an error.
Once this is done, we will have to make decision if there is a reasonable way to follow to allow in the future multizim searches on ZIM files in differents languages or even ZIM files with multilanguage content.
@mgautierfr I can work on this
I agree with @kelson42 solution. If we are in single language search (so also single zim search by definition, until we store different languages in the same zim :) ), do the search in this language. Else "ask" for a specific language from the user.
There is still the open question of a single language (A
) search with an explicit search language (B
) provided by the user. Should we do the search in A
or B
?
There is still the open question of a single language (A) search with an explicit search language (B) provided by the user. Should we do the search in A or B ?
This should always return zero result, because there is no zim with language B (we should run the search request only zim with the same languahe as the one requested).
@Bastien-BSF If good for you we would proceed by doing so. The fundamental fix regarding multizim and multilanguage search would be done later, its not OPDS related and is a scenario which has never been supported properly.
That's good for me. Thanks.
@veloman-yunkan Could you please implememt what has been decide in https://github.com/kiwix/libkiwix/issues/785#issuecomment-1272571526? I have reassigned to you the ticket following a discussion with @mgautierfr and with his agreement. That way he could focus on finish something to make next release of libzim.
I will no create a new ticket in openzim/libzim regarding the conplex problem around multizim search with multiple languages... to then close that that one.
Closed in favour of https://github.com/openzim/libzim/issues/734
Searching for multiple keyword gives empty results, see tests cases below.
Example :
However single key word gives results : http://kiwix.kiwix-test.ideas-box.cc/search?format=html&pattern=Exigences