seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

behavior of partial_token_set_ratio #251

Open pagpires opened 4 years ago

pagpires commented 4 years ago

According to the algorithm, currently partial_token_set_ratio will return 100 for string a, b as long as they have a shared token, right? Is this behavior expected?

polm commented 3 years ago

I ran into this, it can cause two strings that only have the word "the" in common to get an 86 point match using WRatio / process.extract. This is not very useful...

NightMachinery commented 3 years ago
In [6]: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2")
   ...: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2 video")
   ...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2")
   ...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2 video")
   ...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2")
   ...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2 video")
   ...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2")
   ...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2 video")
Out[6]: 77
Out[6]: 77
Out[6]: 100
Out[6]: 100
Out[6]: 82
Out[6]: 82
Out[6]: 69
Out[6]: 100

This behavior seems broken to me, as only partial_ratio sees "study physics physics 2 video" as the better alternative.