seatgeek / thefuzz

Fuzzy String Matching in Python
MIT License
2.76k stars 136 forks source link

The library doesn't use Levenshtein at all, or at least it's inconsistent #53

Open R-N opened 1 year ago

R-N commented 1 year ago

This library is described as fuzzy string matching with Levenshtein distance. However, it doesn't seem to use Levenshtein at all?

fuzz.ratio("tide", "diet") returns:

This is because python-Levenshtein uses Indel distance for ratio instead of Levenshtein. Issue 47 in python-Levenshtein mentioned that "Levenshtein.ratio is based upon the Indel distance and not on the Levenshtein distance". (That's the correct repo I think). Indel distance is similar to Levenshtein, but they are not the same and may produce different results.

But it doesn't seem to just end there. difflib, the fallback for when you don't have python-Levenshtein installed, doesn't seem to use Levenshtein distance either. Issue 128 in the older repository/version of this package mentioned that difflib uses "Ratcliff-Obershelp algorithm, not the Levenshtein distance".

To conclude:

Note: This library uses ratio function from Levenshtein package or difflib for all of its function with ratio in its name.

maxbachmann commented 1 year ago

Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2. Still I agree this is misleading. I think to a part this stems from python-Levenshtein, where Levenshtein.distance calculates the Levenshtein distance, but Levenshtein.ratio calculates the normalized Indel similarity (In my defense this was implemented before I became the maintainer of python-Levenshtein :sweat_smile: )

R-N commented 1 year ago

Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2.

Yeah, I also mentioned they're similar. Still, they're not the same.

I think to a part this stems from python-Levenshtein, where Levenshtein.distance calculates the Levenshtein distance, but Levenshtein.ratio calculates the normalized Indel similarity

Yup. I mentioned that too. I can't help but wonder why. The library is literally named Levenshtein. It could've been an option instead of the default, having "Indel" in the function name like it does for Jaro, or as an optional parameter.

Well, I think the readme should at least be changed to tell people about this. It's where they'd read first and currently it says "It uses Levenshtein Distance" at the very beginning, while it doesn't.

maxbachmann commented 1 year ago

Yes no clue why the original author wrote it like this :man_shrugging: