cyrillic/unicode search not working. Is it intended?

lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node

https://lucaong.github.io/minisearch/

MIT License

4.74k stars 135 forks source link

cyrillic/unicode search not working. Is it intended? #9

Closed nikolay-mihaylov closed 5 years ago

nikolay-mihaylov commented 5 years ago

Hey, just played a bit with your library. Is it intended not to work with cyrillic/unicode strings? Perhaps you did not implement it in terms of performance?

Simple snippet to reproduce:

const documents = [
    {id: 1, title: 'София'},
    {id: 2, title: 'Пловдив'},
    {id: 3, title: 'Sofia'},
    {id: 4, title: 'Plovdiv'},
];   
let miniSearch = new MiniSearch({ fields: ['title']});
miniSearch.addAll(documents);

console.log(miniSearch.autoSuggest('so')); // Works!
console.log(miniSearch.autoSuggest('со')); // Nope :(
console.log(miniSearch.autoSuggest('пло')); // Nope :(
console.log(miniSearch.autoSuggest('Plo')); // Works!

lucaong commented 5 years ago

Hi @nikolay-mihaylov, thanks for taking the time to report this. I suspect the problem is in the default tokenizer, which attempts to split by non-word character using a Unicode range that probably only includes latin characters and their variations. I will try to make it work out of the box for most alphabets, and definitely will document how to support different languages. In the meantime, you can specify your custom tokenizer (e.g. splitting by space) with:

new MiniSearch({
  fields: [...],
  tokenize: (str) => str.split(/\s+/)
})

If my hypothesis is correct, that should be enough to get it to work with Cyrillic too.

I will test myself soon (writing from my mobile now).

Thanks again!

lucaong commented 5 years ago

Hi @nikolay-mihaylov , I can confirm that the problem is with the default tokenizer. You can fix it, while keeping the behavior completely backwards compatible, by setting it to:

new MiniSearch({
  fields: [...],
  tokenize: (string, _fieldName) => string.split(/[^a-zA-Z0-9\u00C0-\u017F\u0400-\u04FF\u0500-\u052F]+/)
})

The additional \u0400-\u04FF and \u0500-\u052F ranges cover Cyrillic and Cyrillic supplement.

I am considering to make this the default, and will definitely do that if there is no performance penalty.

lucaong commented 5 years ago

Hi @nikolay-mihaylov I just released a new version, v1.3.0, that supports Cyrillic (and any non-latin script in Unicode) by default. It's tested, but let me know if you experience any problem with it.

Thanks a lot for reporting this issue!