lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.67k stars 133 forks source link

Multi language support? #211

Closed howesteve closed 1 year ago

howesteve commented 1 year ago

Hi. Thanks for this project.

I need to support both English and Portuguese documents in my project. Is there a way to achieve what I'm trying to? I see there is limited language searching support by design, and I'm ok with it. I am willing to add tokenization/stemming/stopwords/diacritics/etc. manually, that is no problem. However, it seems I can only specify a tokenizer globally on the MiniSearch instance, and not at the MiniSearch.add(doc) call - so I cannot have a per-document language, have to stick to only one language which was assign at instance creation, is that correct? And it seems MiniSearch.searchOptions.tokenize(term) only receives the term being analyzed, no references to the current document. So I cannot inspect the current document to find out what language it is written in in either the doc() or the tokenize() functions.

Thanks.

lucaong commented 1 year ago

Hi @howesteve , You are right, the tokenizer currently does not receive a reference to the document.

Consider thought that the tokenizer and term processor is also used upon search. If upon indexing several different (possibly incompatible) tokenizers and/or term processors were used, search would not know which one to use.

This is why it’s usually better to do one of the following:

What kind of language specific processing are you applying, and how would you implement it if tokenize and processTerm callbacks had a reference to the document? I am happy to consider if this can be made easier, but I am interested in how you would solve incompatibilities of the tokenizer or processor upon search.

lucaong commented 1 year ago

Just for clarity, the first approach that I suggest in my comment above would consist in:

If you do so, you will be able to use the same options for all languages. This is the language-agnostic strategy that I apply on several projects, usually with very good results. Of course, it might depend on the specifics of your project, but I would encourage you to try this out first, and only add language-specific settings such as stemming and stop words if really necessary (MiniSearch is designed to generally handle things well without stop words and stemming).

lucaong commented 1 year ago

@howesteve I will go on and close the issue, as I think the question was answered, and there is no further activity, but do feel free to comment further and I will reopen it if necessary.