nextapps-de / flexsearch

Next-Generation full text search library for Browser and Node.js
Apache License 2.0
12.53k stars 491 forks source link

Cyrillic & latin symbols same time search #182

Closed japanes closed 3 years ago

japanes commented 4 years ago

According to https://github.com/nextapps-de/flexsearch/issues/51 for search by cyrillic symbols we can use bellow options

{
    encode: false,
    split: /\s+/,
    tokenize: "reverse"
}

But this options breaks searching by latin symbols.

For example search by бренд Microsoft (brand Microsoft in english) doesn't work

How I can fix it?

gmfmi commented 4 years ago

Hi @japanes,

I got the same issue and I finally decided to look into the source code to find a solution that do not require to recompile the code.

Basically, we would like to use the simple encoder (instead of setting it to false) but it removes any characters other than latin letters, numbers and spaces. I twisted a little bit the regexp patterns and now everything works. You can search both in latin and cyrillic!

There is probably a prettier way to do it but this should work for you:

{
    split: /\s+/,
    tokenize: "reverse",
    encode: function(str) {
        var regexp_replacements = {
            "a": /[àáâãäå]/g,
            "e": /[èéêë]/g,
            "i": /[ìíîï]/g,
            "o": /[òóôõöő]/g,
            "u": /[ùúûüű]/g,
            "y": /[ýŷÿ]/g,
            "n": /ñ/g,
            "c": /[ç]/g,
            "s": /ß/g,
            " ": /[-/]/g,
            "": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
            " ": /\s+/g,
        }
        str = str.toLowerCase();
        for (var key of Object.keys(regexp_replacements)) {
            str = str.replace(regexp_replacements[key], key);
        }
        return str === " " ? "" : str;
    }
}

Hoping that can help you 😉

mryodo commented 4 years ago

hi @gmfmi!

Thanks for the code! Weirdly though, can't get cyrillic part working for 0.6.32 version (latin is OK). Is it me or the version, how do you think?

gmfmi commented 4 years ago

Hello @mryodo, I created it on v0.6.32 so it is not a version issue. I tested with Japanese/Latin and Sinhala/Latin. I did not try with "tokenize: reverse", maybe try the "forward" setting, that is what I used. I am currently in vacation for the week so I cannot retest it on my own... :/

mryodo commented 4 years ago

@gmfmi thanks! No rush with it, vacation always goes first ;-) Forward option did not make the cut, I guess the fact that I'm plugging it in Gatsby is somehow important.

gmfmi commented 4 years ago

@mryodo, I am back and I retested my code sample. It works as expected so the issue probably comes from the Gatsby integration. If you want to try it by yourself, I made a simplistic HTML example. Copy/paste the following code in any HTML page and open up your browser's console to see the result.

<!DOCTYPE html>
<head>
    <meta charset="UTF-8">
</head>
<body>
    <script src="https://cdn.jsdelivr.net/npm/flexsearch@0.6.32/dist/flexsearch.min.js"></script>
    <script>
        var index = new FlexSearch({
                            split: /\s+/,
                            tokenize: "reverse",
                            encode: function(str) {
                                var regexp_replacements = {
                                    "a": /[àáâãäå]/g,
                                    "e": /[èéêë]/g,
                                    "i": /[ìíîï]/g,
                                    "o": /[òóôõöő]/g,
                                    "u": /[ùúûüű]/g,
                                    "y": /[ýŷÿ]/g,
                                    "n": /ñ/g,
                                    "c": /[ç]/g,
                                    "s": /ß/g,
                                    " ": /[-/]/g,
                                    "": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
                                    " ": /\s+/g,
                                }
                                str = str.toLowerCase();
                                for (var key of Object.keys(regexp_replacements)) {
                                    str = str.replace(regexp_replacements[key], key);
                                }
                                return str === " " ? "" : str;
                            }
                        });

        index.add(0, "бренд Microsoft");

        console.log("English 'rosoft' search result:", index.search("Micro"));
        console.log("Russian 'бре' search result:", index.search("бре"));
        console.log("Russian 'енд' search result:", index.search("енд"));
        console.log("Expected emtpy 'Привет' search result:", index.search("Привет"));
    </script>
</body>
</html>

Good luck to find a solution!

mryodo commented 4 years ago

@gmfmi thank you for your help anyway!

mryodo commented 4 years ago

Okay, I will post here my idiotic quasi-solution if anyone needs it: apparently, there is some kind of a problem with options wrapper in react-use-flexsearch (you can look here): it uses

    const rawResults = index.search(query, searchOptions)

which for me created a problem (I guess; a proper human being would dive into the reasoning, but I'm too lazy). As I realised, I need to pass options in FlexSearch.create(...) as

   const importedIndex = FlexSearch.create({
                        split: /\s+/,
                        tokenize: "reverse",
                        encode: function(str) {
                            var regexp_replacements = {
                                "a": /[àáâãäå]/g,
                                "e": /[èéêë]/g,
                                "i": /[ìíîï]/g,
                                "o": /[òóôõöő]/g,
                                "u": /[ùúûüű]/g,
                                "y": /[ýŷÿ]/g,
                                "n": /ñ/g,
                                "c": /[ç]/g,
                                "s": /ß/g,
                                " ": /[-/]/g,
                                "": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
                                " ": /\s+/g,
                            }
                            str = str.toLowerCase();
                            for (var key of Object.keys(regexp_replacements)) {
                                str = str.replace(regexp_replacements[key], key);
                            }
                            return str === " " ? "" : str;
                        }
                    })

which is a fabulous solution by @gmfmi (pls thumbs up him!). I guess a simpler const importedIndex = FlexSearch.create(searchOptions) will also be okay.

This, of course, imply that I just hard-coded an edited version of react-use-flexsearch index file locally on the site. Kind of a shitshow.

PS maybe this should be moved somewhere

angeloashmore commented 3 years ago

@mryodo I realize I'm replying almost a year later 😅, but you can pass a FlexSearch instance directly to useFlexSearch. By doing that, you shouldn't need to have a hard-coded edited version of the hook in your project.

const importedIndex = FlexSearch.create({
  split: /\s+/,
  tokenize: "reverse",
  encode: function (str) {
    var regexp_replacements = {
      a: /[àáâãäå]/g,
      e: /[èéêë]/g,
      i: /[ìíîï]/g,
      o: /[òóôõöő]/g,
      u: /[ùúûüű]/g,
      y: /[ýŷÿ]/g,
      n: /ñ/g,
      c: /[ç]/g,
      s: /ß/g,
      " ": /[-/]/g,
      "": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
      " ": /\s+/g,
    }
    str = str.toLowerCase()
    for (var key of Object.keys(regexp_replacements)) {
      str = str.replace(regexp_replacements[key], key)
    }
    return str === " " ? "" : str
  },
})

importedIndex.import(yourExistingIndex)

const results = useFlexSearch(query, importedIndex)

Ideally you can instantiate the index outside your React component and only call importedIndex.import once in within your component for better performance.

ts-thomas commented 3 years ago

If you like we can open a new thread in "discussion" to improve this use case. The new version >= 0.7.x has some improvements of "normalize" charset.