rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

partial ratio output unexpected results #400

Closed rocke2020 closed 1 month ago

rocke2020 commented 1 month ago

rapidfuzz 3.9.6, python 3.10 a = '34cdef16z' c = '09cdef78' the intersection of 2 string is "cdef". I think the parital ratio logic may be the length of "cdef" divide the shorted length of inputed sequence pairs, that's partial ratio 0.5. Now, it is 57.14 could you explian how and why 57.14 is calcuated? thanks!! I know Jaccard similarity ratio. But I need a partial edit distance and so prefer "partial ratio" by rapidfuzz

from icecream import ic from rapidfuzz import fuzz

a = '34cdef16z' c = '09cdef78' ic(fuzz.partial_ratio(a, c)) ic(fuzz.partial_token_sort_ratio(a, c)) ic(fuzz.partial_token_ratio(a, c)) ic(fuzz.partial_ratio_alignment(a, c))

rocke2020 commented 1 month ago

https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#rapidfuzz.fuzz.partial_ratio I did read this doc, but still don't know why 57.14 is got

print(fuzz.partial_ratio('34cdef16z', '09cdef78'))
maxbachmann commented 1 month ago

fuzz.partial_ratio uses a sliding window of the short string on the longer string. For each window it calculates the fuzz.ratio and returns the alignment with the highest similarity. These substrings/windows in the longer string can never be longer than the shorter string. However they may be shorter if they are placed at the start/end of the longer string. fuzz.partial_ratio_alignment returns the used alignment which helps in understanding the score. In your example this returns:

>>> fuzz.partial_ratio_alignment(a, c)
ScoreAlignment(score=57.14285714285714, src_start=0, src_end=6, dest_start=0, dest_end=8)

So the used alignment is:

>>> fuzz.ratio(a[0:6], c)
57.14285714285714