openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
165 stars 50 forks source link

Fulltext and suggestion handling of multiple languages #734

Open kelson42 opened 1 year ago

kelson42 commented 1 year ago

This ticket is a follow-up of https://github.com/kiwix/libkiwix/issues/785

Current libzim search features are not working fine with contents in different languages, whereas they are in the same ZIM or not. The search can basically apply only one language strategy in a search (so only one stemmer, only one stopword list).

As a consequence, the multizim search/suggestion feature is limited to one language which is annoying under certain circonstances.

We had a first short list of approaches to go forward on this at https://github.com/kiwix/libkiwix/issues/785#issuecomment-1231837992.

kelson42 commented 1 year ago

Add a language tag to each document. Then the database are considered as multilanguage by definition. [New indexation strategy]

@mgautierfr I’m quite interested by this proposal because it might as well solve the problem of a ZIM file with articles in different languages. Would you be able please to elaborate how this could work? On bith indexation and search?

mgautierfr commented 1 year ago

For now, the language is a property of the whole database. We could instead add a property to each article telling the language of the article. For most database, it would simply move the fra information from the database to all articles. For multilanguage zim files (not really handle for now), we should have a way to tell libzim what is the language of each article (probably by extended IndexData api).

At searching I see different strategies:

kelson42 commented 10 months ago

@mgautierfr What you propose seems appropriate to me. But: