seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

process.extract() with scorer partial_ratio returns wrong results #216

Open SujaySKumar opened 6 years ago

SujaySKumar commented 6 years ago

Correct answer to the following command should be 100.

>>> fuzz.partial_ratio("thane", "nation hospitality honda water thane thane west")
40

Removal of any word from the string nation hospitality honda water thane thane west results in the correct answer of 100.

This issue is reproducible in all installations (Irrespective of whether python-levenshtein is installed or not). Versions:

fuzzywuzzy         0.16.0
python-levenshtein 0.12.0
Python 3.6
josegonzalez commented 6 years ago

Is there a reason the partial ratio result should be 100? And can you add a failing test case to our test suite to prove this?

SujaySKumar commented 6 years ago

Yes. Since the shorter string is a substring of the longer string, partial_ratio should be 100. This is described in detail in github documentation as well as the blog

fuzz.ratio("YANKEES", "NEW YOR") ⇒ 14 fuzz.ratio("YANKEES", "EW YORK") ⇒ 28 fuzz.ratio("YANKEES", "W YORK ") ⇒ 28 fuzz.ratio("YANKEES", " YORK Y") ⇒ 28 ... fuzz.ratio("YANKEES", "YANKEES") ⇒ 100 and conclude that the last one is clearly the best. It turns out that “Yankees” and “New York Yankees” are a perfect partial match…the shorter string is a substring of the longer. We have a helper function for this too (and it’s far more efficient than the simplified algorithm I just laid out) fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100 fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 69

josegonzalez commented 6 years ago

Do you mind adding the appropriate tests to test_fuzzywuzzy.py so that CI hits it and we can see the test fails?

SujaySKumar commented 6 years ago

CI passes since it uses Python 3.5.3. This issue seems to happen in Python 3.6 or even 3.5.6

josegonzalez commented 6 years ago

Mind filing a PR to use 3.5.6?

lisabutti commented 5 years ago

I can confirm that this also happens in Python 3.7

gw00207 commented 5 years ago

in python 3.7, a shorter example that gives the same result:

>>> fuzz.partial_ratio("thane", "t hosa na e thane ws")
40
Lychfindel commented 4 years ago

Is there any solution for this issue?