rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.58k stars 115 forks source link

The results differ significantly from FuzzyWuzzy #112

Closed wavenator closed 3 years ago

wavenator commented 3 years ago

These strings return different results depending on which library you are using.

In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444

In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29
In [3]: rapidfuzz.fuzz.partial_ratio('vodafone', 'blablalbaasjd')
Out[3]: 18.181818181818187

In [4]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'blablalbaasjd')
Out[4]: 12
maxbachmann commented 3 years ago

FuzzyWuzzy uses either difflib as a pure Python implementation or python-Levenshtein. From your results you used the version using python-Levenshtein. In fuzz.partial_ratio there are two relevant algorithms: 1) algorithm to calculate the optimal alignment 2) algorithm to calculate the normalized edit distance

It tries to find a substring of len(shorter) in the longer string (the substring can be shorter if it is placed at the end of the longer string). To find this substring fuzzywuzzy uses difflib.SequenceMatcher with get_matching_blocks (In fact this is not a valid way to find the best aligned substring, but works relatively well in most cases). The implementation of this get_matching_blocks method in python-Levenshtein is completely broken, so for RapidFuzz I decided to use: 1) The implementation of difflib for the optimal alignment 2) the Levenshtein distance for the normalized edit distance So basically a combination of these two variants of FuzzyWuzzy

In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444

In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29

This is an example, where python-Levenshtein has incorrect results. difflib returns the following matching blocks:

>>> Levenshtein.StringMatcher.StringMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[(7, 8, 0)]
>>> difflib.SequenceMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[Match(a=1, b=7, size=1), Match(a=7, b=8, size=0)]

This has the result, that RapidFuzz tests the following alignments:

bentley <-> ne          -> normalized Levenshtein = 44.44444444444444
bentley <-> odafone  -> normalized Levenshtein = 28.57

While FuzzyWuzzy using difflib tests the following alignments:

bentley <-> ne          -> difflib.SequenceMatcher.ratio = 22
bentley <-> odafone  -> difflib.SequenceMatcher.ratio = 14

and FuzzyWuzzy using python-Levenshtein tests these alignments:

bentley <-> odafone  -> normalized Levenshtein = 29

The same applies to the second example. This broken implementation is already known for quite a while: https://github.com/seatgeek/fuzzywuzzy/issues/79, but apparently the results are good enough for SeatGeek and faster than the pure Python implementation.

wavenator commented 3 years ago

I appreciate your thorough and proficient explanation. It makes a lot of sense now.