rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.71k stars 119 forks source link

perform unicode normalization #319

Open maxbachmann opened 1 year ago

maxbachmann commented 1 year ago

The matching results could be improved by using unicode normalization on them. This should be a processor function, since users might be interested in the distance without normalization. In addition it would be weird if Levenshtein.distance(s1, s2) differs from len(Levenshtein.editops(s1, s2)). At the same time it is not possible to use the normalization for Levenshtein.editops, since the editops need to map to a specific character in the source.

It would probably make sense to update utils.default_process to normalize strings as well.