Match diacritics option

CamilleScholtz commented 2 years ago

I don't know if this this already possible, but I would really like the query "guenon" to match both "Guénon" and "Guenon".

lucaong commented 2 years ago

Hi @onodera-punpun , there is no out-of-the-box treatment of diacritics (because the appropriate way to do that depends on the specific use-case), but it is possible to achieve what you want by normalizing terms using the processTerm option. Here is an example:

const normalizeDiacritics = (str) => {
  // This solution work with many diacritics on most browsers, but not all
  // (e.g. the Polish `ł` is not handled correctly).
  // Alternatively, one could use a package like:
  // https://www.npmjs.com/package/normalize-diacritics
  return str.normalize('NFKD').replace(/[^\w]/g, '')
}

const miniSearch = new MiniSearch({
  fields: [/* ... */],
  processTerm: (term) => {
    return normalizeDiacritics(term).toLowerCase()
  }
})

I hope this helps!

CamilleScholtz commented 2 years ago

Exactly what I want, thanks!

odinho commented 2 years ago

Almost what I'd want. But I would actually want a search for "guénon" to only match the one with the diacritics. I've stuffed the index with both terms in the tokenizer, but then you get the problem that searchingn for "gu" it'll match both "guénon" and my stuffed "guenon" term and so its score is artificially high.

Is there a good way to do something like this without too big side effects?

lucaong commented 2 years ago

Hi @odinho , just to check if I understand your requirement: you want a search for guenon to match both guénon and guenon, but a search for guénon should only match guénon and not guenon. Is that correct?

Doing exactly what you want is hard. One possibility could be to index the field without removing diacritics, but also create another "virtual field" that normalizes diacritics. Then you can perform the search without normalization of diacritics, and fallback to a search with normalization on the normalized field if the first one gave no result, or possibly search in both fields with different boosting. Details vary depending on your actual wish, but in general it won't be too simple.

My recommendation:

If the normalized version is considered equivalent to the non-normalized, then normalize like explained in my first comment. For example, the German ß is equivalent to a double s, so Straße and Strasse are equivalent, and one can safely normalize both the documents and the search queries to the second.
If the normalized version is not equivalent, and only done to support common misspelling, then either do not normalize and rely instead on fuzzy match, or create a separate normalized field, possibly boosted less. For example, the Polish ł is different from l, so one might expect Złoty to be misspelled as Zloty, but the second one is not actually correct. Fuzzy match would match both, but giving higher score to the exact match.

sandstrom commented 2 years ago

The search library Sifter (https://github.com/brianreavis/sifter.js/) has support for diacritics. Maybe some inspiration could be taken from that library?

Their implementation is here: https://github.com/brianreavis/sifter.js/blob/master/lib/sifter.js#L444

Could be useful both to solve this outside core, e.g. using the processTerm hook, or if this would be moved into core.

Also, if this isn't moved into core, maybe it could be explained in the Wiki or Readme, how to approach it.

lucaong commented 2 years ago

Thanks @sandstrom , I will look into it.

While I think this should not be included in the core, I agree it would be useful to include it in a “how to” guide.

lucaong commented 2 years ago

The implementation in sifter.js is a good reference. Some diacritics though are better normalized to more than one character, for example the German “ü” is usually considered equivalent to “ue”, not to “u”. Ultimately it’s a use-case specific choice, but sifter’s approach is a reasonable one.

sandstrom commented 2 years ago

@lucaong 👍🏻

lucaong / minisearch

Match diacritics option #153