Open kelson42 opened 1 year ago
Add a language tag to each document. Then the database are considered as multilanguage by definition. [New indexation strategy]
@mgautierfr I’m quite interested by this proposal because it might as well solve the problem of a ZIM file with articles in different languages. Would you be able please to elaborate how this could work? On bith indexation and search?
For now, the language is a property of the whole database.
We could instead add a property to each article telling the language of the article. For most database, it would simply move the fra
information from the database to all articles. For multilanguage zim files (not really handle for now), we should have a way to tell libzim what is the language of each article (probably by extended IndexData
api).
At searching I see different strategies:
<query> AND lang=fra
. So it returns only french articles and we don't search for article in other language.(<parsed_query_fra> AND lang=fra) OR (<parsed_query_eng> AND lang=eng) OR (<parsed_query_esp> AND lang=esp)
. The list of languages can come from the user (select box and correct api) or from the languages we have in the (multi)database(s).@mgautierfr What you propose seems appropriate to me. But:
This ticket is a follow-up of https://github.com/kiwix/libkiwix/issues/785
Current libzim search features are not working fine with contents in different languages, whereas they are in the same ZIM or not. The search can basically apply only one language strategy in a search (so only one stemmer, only one stopword list).
As a consequence, the multizim search/suggestion feature is limited to one language which is annoying under certain circonstances.
We had a first short list of approaches to go forward on this at https://github.com/kiwix/libkiwix/issues/785#issuecomment-1231837992.