seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

Partial_Ratio not working #279

Open aW3st opened 4 years ago

aW3st commented 4 years ago

Having some weird issues using partial ratio. Here's the code:

test_string = ('completed transactions settlement date trade date '
               'symbol name transaction type account type quantity price commissions & fees amount '
               '12/23 12/23 dividend '
               'appreciation etf dividend - - - $441.99 12/23 12/23 '
               'vig dividend appreciation etf reinvestment cash')

'etf' in test_string # returns True
fuzz.partial_ratio('etf', test_string)

without python-levenshtein this returns 33, with python levenshtein 67. My understanding of the method is that it should be 100, since there's a substring that's a perfect match. Any ideas?

(on python 3.8, btw)

XDGFX commented 4 years ago

I'm having the same issue, I would also expect a score of 100 with the below function

>>> artists_a
'carvar & clock'
>>> artists_b
'carvar clock'
>>> fuzz.partial_ratio(artists_a, artists_b)
83
>>> fuzz.partial_ratio(artists_b, artists_a)
83

I also tried without python-Levenshtein as suggested in #79 but exact same result.

XDGFX commented 4 years ago

Possibly replace partial_ratio with partial_token_sort_ratio, as mentioned on this stackoverflow answer. In both our examples it seemed to work as expected.

maxbachmann commented 4 years ago

partial_ratio searches for the best alignment between two strings and the calculates the fuzz.ratio for this alignment. So while in @aW3st case the word 'etf' is part of the second string therefore you would expect the result 100, thats not the case in your example @XDGFX. When comparing 'carvar & clock' and 'carvar clock' they are no substring of each other. However when using partial_token_sort_ratio it works since it resorts the words to 'carvar clock &' and 'carvar clock'. So afterwards 'carvar clock' is a substring of 'carvar clock &' ;)

@aW3st you tried both with python-Levenshtein and without and both have wrong results for different reasons. 1) Python-Levenshtein has a known bug with finding the optimal alignment between strings, which is probably the bug your encountering here aswell. You can find this here: https://github.com/seatgeek/fuzzywuzzy/issues/79#issue-58664443 2) when not using python-Levenshtein fuzzywuzzy falls back to difflib. Here the problem appears to occur when using the automatic junk heuristic of difflib which is activated by default. So it would be required to change https://github.com/seatgeek/fuzzywuzzy/blob/2188520502b86375cf2610b5100a56935417671f/fuzzywuzzy/fuzz.py#L46 to

m = SequenceMatcher(None, shorter, longer, False)

As a sidenote my library rapidfuzz provides the same string matching algorithm without this problem, so your example string returns a score of 100 as you expected

aW3st commented 4 years ago

Thanks Max, I'll give your library a shot!

thomkav commented 4 years ago

@maxbachmann Hi Max, I'm working with @aW3st on a project. We've swapped fuzzywuzzy for your library, and we're seeing great performance. Thanks!