rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.72k stars 119 forks source link

Token set scorers behave weirdly #76

Closed NightMachinery closed 3 years ago

NightMachinery commented 3 years ago

Is this behavior correct? Why is "study physics physics 2 video" not scored higher than "study physics physics 2" in all scorers except partial_ratio?

In [11]: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.token_sort_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.token_sort_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2 video")
Out[11]: 76.92307692307692
Out[11]: 76.92307692307692
Out[11]: 100.0
Out[11]: 100.0
Out[11]: 81.81818181818181
Out[11]: 81.81818181818181
Out[11]: 66.66666666666666
Out[11]: 61.904761904761905
Out[11]: 78.26086956521739
Out[11]: 92.3076923076923
maxbachmann commented 3 years ago

Looking through the results the result of fuzz.partial_ratio is out. Turns out the new bitparallel LCS algorithm I use in there does not work properly in some cases. I added some tests for the issue and use the old implementation again until I find the bug. I did just release this fix as v1.0.2.

The results are now

fuzz.partial_ratio("physics 2 vid", "study physics physics 2")
81.81818181818181
fuzz.partial_ratio("physics 2 vid", "study physics physics 2 video")
100.0

for the first one FuzzyWuzzy returns a differents score when it is used with python-Levenshtein since it uses a different alignment. RapidFuzz and FuzzyWuzzy (difflib) use

"physics 2 vid"
"physics 2"

which has a similarity of 82 while python-Levenshtein uses

"physics 2 vid"
"physics phys"

which has a similarity of 69. However thats a bug in python-Levenshtein

Here are the explanations for the other results:

fuzz.partial_token_sort_ratio

The algorithm sorts the tokens and then uses partial_ratio 1) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study' 2) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study video' In both cases the optimal alignment is

'2 physics vid'
'2 physics phy'

fuzz.partial_token_set_ratio

fuzz.partial_token_set_ratio is essentially 100 the moment there is a similar word in both sequences like in your case '2' <-> '2' or 'physics' <-> 'physics'

fuzz.token_set_ratio

fuzz.token_set_ratio sorts, removes duplicate tokens and then calculates the fuzz.ratio of some combinations bases on the intersections between the tokens One of those combinations is only dependent on string 1 and the amount of elements that are similar in both strings. Since in both cases string 1 is the same and the same tokens (2, physics) appear in the second string, they have the same similarity

fuzz.token_sort_ratio

fuzz.token_sort_ratio sorts and then calculates the fuzz.ratio 1) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study' 2) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study video' since the video word is not close enough to vid anymore, it does not reduce the edit distance, but requires additional insertions into string1. Hence the similarity in case 1 is higher

NightMachinery commented 3 years ago

Thank you very much for the detailed exposition. I don’t understand why fuzz.token_set_ratio behaves that way?

I ultimately switched to using fzf, it behaves pretty intuitively for my use case. Feel free to close the issue as you see fit.

maxbachmann commented 3 years ago

token_set_ratio performs the following steps 1) Input strings "physics 2 vid" "study physics physics 2"

2) splits the strings into words and treats them as set -> removes duplicated words {"physics", "2", vid"} {"study", "physics", "2"}

3) find the intersection and differences between the sets: intersection = {"physics", "2"} diff1to2 = {"vid"} diff2to1 = {"study"}

4) join those new sets in a sorted way sorted_sect = "2 physics" sorted_1to2 = "vid" sorted_2to1 = " study"

combined_1to2 = sorted_sect + " " + sorted_1to2 = "2 physics vid"
combined_2to1 = sorted_sect + " " + sorted_2to1 = "2 physics study"

5) calculate the following distances: 5.1) ratio(sorted_sect, combined_1to2) -> 81.81818181818181 5.2) ratio(sorted_sect, combined_2to1) -> 75.0 5.3) ratio(combined_1to2, combined_2to1) -> 78.57142857142857

6) return the max result -> 81.81818181818181

In your case ratio(sorted_sect, combined_1to2) is the same for both string combinations.