hiddentao / fast-levenshtein

Efficient Javascript implementation of Levenshtein algorithm with locale-specific collator support.
MIT License
592 stars 56 forks source link

Consider using Intl.Collator to handle case and accent substitutions #7

Closed alsciende closed 8 years ago

alsciende commented 9 years ago

I like our library, but I wanted to be more lenient on case and/or accent differences in the input.

For example, with your library:

Levenshtein.get('mikailovitch', 'Mikhaïlovitch') 3

Levenshtein.get('mikailovitch', 'Vikhaklovitch') 3

I'd like to return 1 for 'Mikhaïlovitch', because 'm' and 'M' are "alike", and "i" and "ï" are "alike" as well. That way, the string 'mikailovitch' is closer to 'Mikhaïlovitch' than to 'Vikhaklovitch'.

So I made one simple change. I added a variable in the closure

var collator = Intl.Collator("generic", { sensitivity: "base" });

And I changed the condition for "substitution", from

nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );

to

nextCol = prevRow[j] + ( (collator.compare(str1.charAt(i), str2.charAt(j)) === 0) ? 0 : 1 );

With this simple change, I got much better results on accented letters and difference in case.

kaore commented 9 years ago

+1

hiddentao commented 8 years ago

Done. Pushed v2 with a slight performance update.