Closed CamilleScholtz closed 2 years ago
Hi @onodera-punpun ,
there is no out-of-the-box treatment of diacritics (because the appropriate way to do that depends on the specific use-case), but it is possible to achieve what you want by normalizing terms using the processTerm
option. Here is an example:
const normalizeDiacritics = (str) => {
// This solution work with many diacritics on most browsers, but not all
// (e.g. the Polish `ł` is not handled correctly).
// Alternatively, one could use a package like:
// https://www.npmjs.com/package/normalize-diacritics
return str.normalize('NFKD').replace(/[^\w]/g, '')
}
const miniSearch = new MiniSearch({
fields: [/* ... */],
processTerm: (term) => {
return normalizeDiacritics(term).toLowerCase()
}
})
I hope this helps!
Exactly what I want, thanks!
Almost what I'd want. But I would actually want a search for "guénon" to only match the one with the diacritics. I've stuffed the index with both terms in the tokenizer, but then you get the problem that searchingn for "gu" it'll match both "guénon" and my stuffed "guenon" term and so its score is artificially high.
Is there a good way to do something like this without too big side effects?
Hi @odinho ,
just to check if I understand your requirement: you want a search for guenon
to match both guénon
and guenon
, but a search for guénon
should only match guénon
and not guenon
. Is that correct?
Doing exactly what you want is hard. One possibility could be to index the field without removing diacritics, but also create another "virtual field" that normalizes diacritics. Then you can perform the search without normalization of diacritics, and fallback to a search with normalization on the normalized field if the first one gave no result, or possibly search in both fields with different boosting. Details vary depending on your actual wish, but in general it won't be too simple.
My recommendation:
ß
is equivalent to a double s
, so Straße
and Strasse
are equivalent, and one can safely normalize both the documents and the search queries to the second.ł
is different from l
, so one might expect Złoty
to be misspelled as Zloty
, but the second one is not actually correct. Fuzzy match would match both, but giving higher score to the exact match.The search library Sifter (https://github.com/brianreavis/sifter.js/) has support for diacritics. Maybe some inspiration could be taken from that library?
Their implementation is here: https://github.com/brianreavis/sifter.js/blob/master/lib/sifter.js#L444
Could be useful both to solve this outside core, e.g. using the processTerm hook, or if this would be moved into core.
Also, if this isn't moved into core, maybe it could be explained in the Wiki or Readme, how to approach it.
Thanks @sandstrom , I will look into it.
While I think this should not be included in the core, I agree it would be useful to include it in a “how to” guide.
The implementation in sifter.js is a good reference. Some diacritics though are better normalized to more than one character, for example the German “ü” is usually considered equivalent to “ue”, not to “u”. Ultimately it’s a use-case specific choice, but sifter’s approach is a reasonable one.
@lucaong 👍🏻
I don't know if this this already possible, but I would really like the query "guenon" to match both "Guénon" and "Guenon".