seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.2k stars 878 forks source link

Installing python-Levenshtein as suggested by the warnings gives different results. #318

Open JeremyThiesen opened 3 years ago

JeremyThiesen commented 3 years ago

I was running this code:

from fuzzywuzzy import fuzz
partial_ratio = fuzz.partial_ratio('more than fifty', 'i know that because a lion run fifty mile per hour and a cheetah run about eighty mile per hour and sixty-five be more than fifty and be slow than eighty')
print (partial_ratio)

At fuzzywuzzy version 0.18.0, it gives the answer of 100. It also gives the following user warning.

UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

Installing python-Levenshtein at version 0.12.2, then gives the result answer of 87 for the preceeding code block, which is incorrect since there is an exact match.

maxbachmann commented 3 years ago

This issue has already been reported: https://github.com/seatgeek/fuzzywuzzy/issues/79 The implementation in python-Levenshtein provides incorrect results in some cases. So you can: 1) use the slower difflib based version (and possibly suppress the warning) 2) use the python-Levenshtein version which can provide incorrect results for any ratio which uses partial_ratio 3) use RapidFuzz (I am the author) which provides a fast implementation providing similar results to the difflib based implementation

It would be possible to fix this behavior for fuzzywuzzy/python-Levenshtein. However since both projects are not really maintained anymore it is unclear if/when this will be fixed.