Closed japanes closed 3 years ago
Hi @japanes,
I got the same issue and I finally decided to look into the source code to find a solution that do not require to recompile the code.
Basically, we would like to use the simple
encoder (instead of setting it to false
) but it removes any characters other than latin letters, numbers and spaces. I twisted a little bit the regexp patterns and now everything works. You can search both in latin and cyrillic!
There is probably a prettier way to do it but this should work for you:
{
split: /\s+/,
tokenize: "reverse",
encode: function(str) {
var regexp_replacements = {
"a": /[àáâãäå]/g,
"e": /[èéêë]/g,
"i": /[ìíîï]/g,
"o": /[òóôõöő]/g,
"u": /[ùúûüű]/g,
"y": /[ýŷÿ]/g,
"n": /ñ/g,
"c": /[ç]/g,
"s": /ß/g,
" ": /[-/]/g,
"": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
" ": /\s+/g,
}
str = str.toLowerCase();
for (var key of Object.keys(regexp_replacements)) {
str = str.replace(regexp_replacements[key], key);
}
return str === " " ? "" : str;
}
}
Hoping that can help you 😉
hi @gmfmi!
Thanks for the code! Weirdly though, can't get cyrillic part working for 0.6.32 version (latin is OK). Is it me or the version, how do you think?
Hello @mryodo, I created it on v0.6.32 so it is not a version issue. I tested with Japanese/Latin and Sinhala/Latin. I did not try with "tokenize: reverse", maybe try the "forward" setting, that is what I used. I am currently in vacation for the week so I cannot retest it on my own... :/
@gmfmi thanks! No rush with it, vacation always goes first ;-) Forward option did not make the cut, I guess the fact that I'm plugging it in Gatsby is somehow important.
@mryodo, I am back and I retested my code sample. It works as expected so the issue probably comes from the Gatsby integration. If you want to try it by yourself, I made a simplistic HTML example. Copy/paste the following code in any HTML page and open up your browser's console to see the result.
<!DOCTYPE html>
<head>
<meta charset="UTF-8">
</head>
<body>
<script src="https://cdn.jsdelivr.net/npm/flexsearch@0.6.32/dist/flexsearch.min.js"></script>
<script>
var index = new FlexSearch({
split: /\s+/,
tokenize: "reverse",
encode: function(str) {
var regexp_replacements = {
"a": /[àáâãäå]/g,
"e": /[èéêë]/g,
"i": /[ìíîï]/g,
"o": /[òóôõöő]/g,
"u": /[ùúûüű]/g,
"y": /[ýŷÿ]/g,
"n": /ñ/g,
"c": /[ç]/g,
"s": /ß/g,
" ": /[-/]/g,
"": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
" ": /\s+/g,
}
str = str.toLowerCase();
for (var key of Object.keys(regexp_replacements)) {
str = str.replace(regexp_replacements[key], key);
}
return str === " " ? "" : str;
}
});
index.add(0, "бренд Microsoft");
console.log("English 'rosoft' search result:", index.search("Micro"));
console.log("Russian 'бре' search result:", index.search("бре"));
console.log("Russian 'енд' search result:", index.search("енд"));
console.log("Expected emtpy 'Привет' search result:", index.search("Привет"));
</script>
</body>
</html>
Good luck to find a solution!
@gmfmi thank you for your help anyway!
Okay, I will post here my idiotic quasi-solution if anyone needs it: apparently, there is some kind of a problem with options wrapper in react-use-flexsearch
(you can look here): it uses
const rawResults = index.search(query, searchOptions)
which for me created a problem (I guess; a proper human being would dive into the reasoning, but I'm too lazy). As I realised, I need to pass options in FlexSearch.create(...)
as
const importedIndex = FlexSearch.create({
split: /\s+/,
tokenize: "reverse",
encode: function(str) {
var regexp_replacements = {
"a": /[àáâãäå]/g,
"e": /[èéêë]/g,
"i": /[ìíîï]/g,
"o": /[òóôõöő]/g,
"u": /[ùúûüű]/g,
"y": /[ýŷÿ]/g,
"n": /ñ/g,
"c": /[ç]/g,
"s": /ß/g,
" ": /[-/]/g,
"": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
" ": /\s+/g,
}
str = str.toLowerCase();
for (var key of Object.keys(regexp_replacements)) {
str = str.replace(regexp_replacements[key], key);
}
return str === " " ? "" : str;
}
})
which is a fabulous solution by @gmfmi (pls thumbs up him!). I guess a simpler const importedIndex = FlexSearch.create(searchOptions)
will also be okay.
This, of course, imply that I just hard-coded an edited version of react-use-flexsearch
index file locally on the site. Kind of a shitshow.
PS maybe this should be moved somewhere
@mryodo I realize I'm replying almost a year later 😅, but you can pass a FlexSearch instance directly to useFlexSearch
. By doing that, you shouldn't need to have a hard-coded edited version of the hook in your project.
const importedIndex = FlexSearch.create({
split: /\s+/,
tokenize: "reverse",
encode: function (str) {
var regexp_replacements = {
a: /[àáâãäå]/g,
e: /[èéêë]/g,
i: /[ìíîï]/g,
o: /[òóôõöő]/g,
u: /[ùúûüű]/g,
y: /[ýŷÿ]/g,
n: /ñ/g,
c: /[ç]/g,
s: /ß/g,
" ": /[-/]/g,
"": /['!"#$%&\\()\*+,-./:;<=>?@[\]^_`{|}~]/g,
" ": /\s+/g,
}
str = str.toLowerCase()
for (var key of Object.keys(regexp_replacements)) {
str = str.replace(regexp_replacements[key], key)
}
return str === " " ? "" : str
},
})
importedIndex.import(yourExistingIndex)
const results = useFlexSearch(query, importedIndex)
Ideally you can instantiate the index outside your React component and only call importedIndex.import
once in within your component for better performance.
If you like we can open a new thread in "discussion" to improve this use case. The new version >= 0.7.x has some improvements of "normalize" charset.
According to https://github.com/nextapps-de/flexsearch/issues/51 for search by cyrillic symbols we can use bellow options
But this options breaks searching by latin symbols.
For example search by
бренд Microsoft
(brand Microsoft in english) doesn't workHow I can fix it?