Closed nikolay-mihaylov closed 5 years ago
Hi @nikolay-mihaylov, thanks for taking the time to report this. I suspect the problem is in the default tokenizer, which attempts to split by non-word character using a Unicode range that probably only includes latin characters and their variations. I will try to make it work out of the box for most alphabets, and definitely will document how to support different languages. In the meantime, you can specify your custom tokenizer (e.g. splitting by space) with:
new MiniSearch({
fields: [...],
tokenize: (str) => str.split(/\s+/)
})
If my hypothesis is correct, that should be enough to get it to work with Cyrillic too.
I will test myself soon (writing from my mobile now).
Thanks again!
Hi @nikolay-mihaylov , I can confirm that the problem is with the default tokenizer. You can fix it, while keeping the behavior completely backwards compatible, by setting it to:
new MiniSearch({
fields: [...],
tokenize: (string, _fieldName) => string.split(/[^a-zA-Z0-9\u00C0-\u017F\u0400-\u04FF\u0500-\u052F]+/)
})
The additional \u0400-\u04FF
and \u0500-\u052F
ranges cover Cyrillic and Cyrillic supplement.
I am considering to make this the default, and will definitely do that if there is no performance penalty.
Hi @nikolay-mihaylov
I just released a new version, v1.3.0
, that supports Cyrillic (and any non-latin script in Unicode) by default. It's tested, but let me know if you experience any problem with it.
Thanks a lot for reporting this issue!
Hey, just played a bit with your library. Is it intended not to work with cyrillic/unicode strings? Perhaps you did not implement it in terms of performance?
Simple snippet to reproduce: