seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

Open samarth12 opened 4 years ago

samarth12 commented 4 years ago

I am trying to wrap my head around how it calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.

The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.

Here is are a couple of examples:

Example 1 s1 = 'hello' s2 = 'hell'

Levenshtein Score = 1 (it requires 1 edit, addition of 'o')

Example 2 s1 = 'hello' s2 = 'hella'

Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')

Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):

Example 1: (5+4-1)/9 = 89% Example 2: (5+5-1)/10 = 90%

Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?

I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?

Would love if somebody could give me some insight, thank you!

taraldb commented 4 years ago

Hi,

It uses python Levenshtein as a replacement for difflib. Implementation uses levenshtein for speedup of methods, but doesn't actually return LD.

Ratio is based on real minimal edit distance.

maxbachmann commented 3 years ago

The Levenshtein distance implementation uses the following weights:

and normalizes the result the following way: 100 * (1 - distance / (len(s1)+len(s2))) this is closer to

this provides results, that are very close to the results of difflib.

In fact at least in the german wikipedia article of the levenshtein distance this is listed as a variant called weighted levenshtein distance. The "normal" Levenshtein distance would be normalised using 100 * (1 - distance / max(len(s1)+len(s2)))