Closed howesteve closed 1 year ago
Hi @howesteve , You are right, the tokenizer currently does not receive a reference to the document.
Consider thought that the tokenizer and term processor is also used upon search. If upon indexing several different (possibly incompatible) tokenizers and/or term processors were used, search would not know which one to use.
This is why it’s usually better to do one of the following:
title_en
vs. title_pt
. This allows you to use different tokenizers, but can get cumbersome in some cases.What kind of language specific processing are you applying, and how would you implement it if tokenize
and processTerm
callbacks had a reference to the document? I am happy to consider if this can be made easier, but I am interested in how you would solve incompatibilities of the tokenizer or processor upon search.
Just for clarity, the first approach that I suggest in my comment above would consist in:
If you do so, you will be able to use the same options for all languages. This is the language-agnostic strategy that I apply on several projects, usually with very good results. Of course, it might depend on the specifics of your project, but I would encourage you to try this out first, and only add language-specific settings such as stemming and stop words if really necessary (MiniSearch is designed to generally handle things well without stop words and stemming).
@howesteve I will go on and close the issue, as I think the question was answered, and there is no further activity, but do feel free to comment further and I will reopen it if necessary.
Hi. Thanks for this project.
I need to support both English and Portuguese documents in my project. Is there a way to achieve what I'm trying to? I see there is limited language searching support by design, and I'm ok with it. I am willing to add tokenization/stemming/stopwords/diacritics/etc. manually, that is no problem. However, it seems I can only specify a tokenizer globally on the MiniSearch instance, and not at the MiniSearch.add(doc) call - so I cannot have a per-document language, have to stick to only one language which was assign at instance creation, is that correct? And it seems MiniSearch.searchOptions.tokenize(term) only receives the term being analyzed, no references to the current document. So I cannot inspect the current document to find out what language it is written in in either the doc() or the tokenize() functions.
Thanks.