Closed wavenator closed 3 years ago
FuzzyWuzzy uses either difflib as a pure Python implementation or python-Levenshtein. From your results you used the version using python-Levenshtein. In fuzz.partial_ratio there are two relevant algorithms: 1) algorithm to calculate the optimal alignment 2) algorithm to calculate the normalized edit distance
It tries to find a substring of len(shorter) in the longer string (the substring can be shorter if it is placed at the end of the longer string). To find this substring fuzzywuzzy uses difflib.SequenceMatcher with get_matching_blocks (In fact this is not a valid way to find the best aligned substring, but works relatively well in most cases). The implementation of this get_matching_blocks method in python-Levenshtein is completely broken, so for RapidFuzz I decided to use: 1) The implementation of difflib for the optimal alignment 2) the Levenshtein distance for the normalized edit distance So basically a combination of these two variants of FuzzyWuzzy
In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444
In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29
This is an example, where python-Levenshtein has incorrect results. difflib returns the following matching blocks:
>>> Levenshtein.StringMatcher.StringMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[(7, 8, 0)]
>>> difflib.SequenceMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[Match(a=1, b=7, size=1), Match(a=7, b=8, size=0)]
This has the result, that RapidFuzz tests the following alignments:
bentley <-> ne -> normalized Levenshtein = 44.44444444444444
bentley <-> odafone -> normalized Levenshtein = 28.57
While FuzzyWuzzy using difflib tests the following alignments:
bentley <-> ne -> difflib.SequenceMatcher.ratio = 22
bentley <-> odafone -> difflib.SequenceMatcher.ratio = 14
and FuzzyWuzzy using python-Levenshtein tests these alignments:
bentley <-> odafone -> normalized Levenshtein = 29
The same applies to the second example. This broken implementation is already known for quite a while: https://github.com/seatgeek/fuzzywuzzy/issues/79, but apparently the results are good enough for SeatGeek and faster than the pure Python implementation.
I appreciate your thorough and proficient explanation. It makes a lot of sense now.
These strings return different results depending on which library you are using.