nextapps-de / flexsearch

Next-Generation full text search library for Browser and Node.js
Apache License 2.0
12.53k stars 491 forks source link

Mixed English and CJK (multi-lang) #202

Closed favoyang closed 3 years ago

favoyang commented 3 years ago

How flexsearch handle mixed langs, like English mixed with CJK?

English:

FlexSearch.create({
    encode: "icase",
    tokenize: "reverse"
});

CJK

FlexSearch.create({
    encode: false,
    tokenize: function(str){
        return str.replace(/[\x00-\x7F]/g, "").split("");
    }
});

By mixing these two I get:

FlexSearch.create({
    encode: "icase",
    tokenize: function(str){
        const cjkItems = str.replace(/[\x00-\x7F]/g, "").split("");
        const asciiItems = str.replace(/[^\x00-\x7F]/g, "").split(/\W+/);
        return cjkItems.concat(asciiItems);
    }
});

But I want to achieve that keep the CJK tokens but apply the "reverse" behaviors to English letters. Possible?

i.e.

"Flexsearch是个轻量级的搜索引擎"

search => matched
flexsearch => matched
搜索引擎 => matched
ts-thomas commented 3 years ago

realted to #207 you need to provide your own "encoder" which will apply these transformation in dependence of matched language.

ts-thomas commented 3 years ago

Probably the best solution is to use 2 indexes, each for every language and apply your queries on one or both of them.

mmm8955405 commented 2 years ago

encode: function(str){ const cjkItems = str.replace(/[\x00-\x7F]/g, "").split(""); const asciiItems = str.split(/\W+/); return cjkItems.concat(asciiItems); }

It does work! But I don't know what impact it will have on performance