Closed NightMachinery closed 3 years ago
Looking through the results the result of fuzz.partial_ratio is out. Turns out the new bitparallel LCS algorithm I use in there does not work properly in some cases. I added some tests for the issue and use the old implementation again until I find the bug. I did just release this fix as v1.0.2.
The results are now
fuzz.partial_ratio("physics 2 vid", "study physics physics 2")
81.81818181818181
fuzz.partial_ratio("physics 2 vid", "study physics physics 2 video")
100.0
for the first one FuzzyWuzzy returns a differents score when it is used with python-Levenshtein since it uses a different alignment. RapidFuzz and FuzzyWuzzy (difflib) use
"physics 2 vid"
"physics 2"
which has a similarity of 82 while python-Levenshtein uses
"physics 2 vid"
"physics phys"
which has a similarity of 69. However thats a bug in python-Levenshtein
The algorithm sorts the tokens and then uses partial_ratio 1) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study' 2) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study video' In both cases the optimal alignment is
'2 physics vid'
'2 physics phy'
fuzz.partial_token_set_ratio is essentially 100 the moment there is a similar word in both sequences like in your case '2' <-> '2' or 'physics' <-> 'physics'
fuzz.token_set_ratio sorts, removes duplicate tokens and then calculates the fuzz.ratio of some combinations bases on the intersections between the tokens One of those combinations is only dependent on string 1 and the amount of elements that are similar in both strings. Since in both cases string 1 is the same and the same tokens (2, physics) appear in the second string, they have the same similarity
fuzz.token_sort_ratio sorts and then calculates the fuzz.ratio 1) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study' 2) calculates the partial_ratio of '2 physics vid' <-> '2 physics physics study video' since the video word is not close enough to vid anymore, it does not reduce the edit distance, but requires additional insertions into string1. Hence the similarity in case 1 is higher
Thank you very much for the detailed exposition. I don’t understand why fuzz.token_set_ratio behaves that way?
I ultimately switched to using fzf, it behaves pretty intuitively for my use case. Feel free to close the issue as you see fit.
token_set_ratio performs the following steps 1) Input strings "physics 2 vid" "study physics physics 2"
2) splits the strings into words and treats them as set -> removes duplicated words {"physics", "2", vid"} {"study", "physics", "2"}
3) find the intersection and differences between the sets: intersection = {"physics", "2"} diff1to2 = {"vid"} diff2to1 = {"study"}
4) join those new sets in a sorted way sorted_sect = "2 physics" sorted_1to2 = "vid" sorted_2to1 = " study"
combined_1to2 = sorted_sect + " " + sorted_1to2 = "2 physics vid"
combined_2to1 = sorted_sect + " " + sorted_2to1 = "2 physics study"
5) calculate the following distances: 5.1) ratio(sorted_sect, combined_1to2) -> 81.81818181818181 5.2) ratio(sorted_sect, combined_2to1) -> 75.0 5.3) ratio(combined_1to2, combined_2to1) -> 78.57142857142857
6) return the max result -> 81.81818181818181
In your case ratio(sorted_sect, combined_1to2) is the same for both string combinations.
Is this behavior correct? Why is
"study physics physics 2 video"
not scored higher than"study physics physics 2"
in all scorers exceptpartial_ratio
?