Inconsistent score from fuzz.ratio() between linux and windows

seatgeek / fuzzywuzzy

Fuzzy String Matching in Python

http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

GNU General Public License v2.0

9.21k stars 874 forks source link

Inconsistent score from fuzz.ratio() between linux and windows #271

Closed joshuamin2014 closed 4 years ago

joshuamin2014 commented 4 years ago

Hi, I ran the following code on my windows machine and linux machine, and windows returns 44, whereas linux returns 33. Is it a known issue?

t1 = 'A L I E N S' t2 = 'A S A L' fuzzywuzzy.fuzz.ratio(t1, t2)

(fuzzywuzzy version 0.17.0)

maxbachmann commented 4 years ago

Hm for me this returns 44 on Linux aswell

joshuamin2014 commented 4 years ago

fyi, it looks like it depends on python-Levenshtein (my colleague did an experiment, and then I followed to see the same behavior); with python-Levenshtein, it is returning 44, and 33 without it. I think this is a critical bug. Can somebody work on this?

maxbachmann commented 4 years ago

This is because it is using difflib when using a pure python implementation. The issue you describe can be found here: https://github.com/seatgeek/fuzzywuzzy/issues/128 or in the Readme as:

python-Levenshtein (optional, provides a 4-10x speedup in String Matching, though may result in differing results for certain cases)

So when you want one or the other behaviour you should install/not install python-Levenshtein

joshuamin2014 commented 4 years ago

Oh, thanks for the clarification! I was testing an ML algorithm on my windows machine, and I saw different behavior when I tested on SageMaker. Mystery solved. Thanks!

maxbachmann commented 4 years ago

Yes I can understand why this can be confusing