Multi language support?

howesteve commented 1 year ago

Hi. Thanks for this project.

I need to support both English and Portuguese documents in my project. Is there a way to achieve what I'm trying to? I see there is limited language searching support by design, and I'm ok with it. I am willing to add tokenization/stemming/stopwords/diacritics/etc. manually, that is no problem. However, it seems I can only specify a tokenizer globally on the MiniSearch instance, and not at the MiniSearch.add(doc) call - so I cannot have a per-document language, have to stick to only one language which was assign at instance creation, is that correct? And it seems MiniSearch.searchOptions.tokenize(term) only receives the term being analyzed, no references to the current document. So I cannot inspect the current document to find out what language it is written in in either the doc() or the tokenize() functions.

Thanks.

lucaong commented 1 year ago

Hi @howesteve , You are right, the tokenizer currently does not receive a reference to the document.

Consider thought that the tokenizer and term processor is also used upon search. If upon indexing several different (possibly incompatible) tokenizers and/or term processors were used, search would not know which one to use.

This is why it’s usually better to do one of the following:

Either use a single tokenization/processing that works for both languages. I find, for example, that stemming is usually hurting more that it helps, and I rely on fuzzy match instead on most projects.
Alternatively, use different indexes for each language, and issue each search to all of the applicable ones. You might have to merge and re-sort the results though.
Finally, in some cases you could use different fields for different languages, like title_en vs. title_pt. This allows you to use different tokenizers, but can get cumbersome in some cases.

What kind of language specific processing are you applying, and how would you implement it if tokenize and processTerm callbacks had a reference to the document? I am happy to consider if this can be made easier, but I am interested in how you would solve incompatibilities of the tokenizer or processor upon search.

lucaong commented 1 year ago

Just for clarity, the first approach that I suggest in my comment above would consist in:

Do not apply any stemming: in case, use fuzzy search instead
Do not remove stop words: the B25 scoring will automatically “discount” overly frequent non-descriptive words
Normalize diacritics, casing, etc. in the same way for all languages (this is usually fine, as long as there is a 1-to-1 mapping for each diacritic)

If you do so, you will be able to use the same options for all languages. This is the language-agnostic strategy that I apply on several projects, usually with very good results. Of course, it might depend on the specifics of your project, but I would encourage you to try this out first, and only add language-specific settings such as stemming and stop words if really necessary (MiniSearch is designed to generally handle things well without stop words and stemming).

lucaong commented 1 year ago

@howesteve I will go on and close the issue, as I think the question was answered, and there is no further activity, but do feel free to comment further and I will reopen it if necessary.

lucaong / minisearch

Multi language support? #211