rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Details on Wratio #359

Closed lamaeldo closed 5 months ago

lamaeldo commented 7 months ago

Hello! Could you provide some detail on the calculations behind fuzz.WRatio? Is it a combined score of all the other ratio algorithms with even weights? Can the list of algorithms used be edited, or the weights changed? Thanks!

maxbachmann commented 7 months ago

In pseudo-code it does the following:

len_ratio = max(len1, len2) / min(len1, len2)
if len_ratio < 1.5:
    return max(
        ratio(s1, s2),
        token_ratio(s1, s2) * 0.95
    )
else:
   scale = 0.9 if len_ratio < 8.0 else 0.6
    return max(
        partial_ratio(s1, s2) * scale,
        partial_token_ratio(s1, s2) * 0.95 * scale
    )

This algorithm was originally created by Seatgeek in their fuzzywuzzy library and I do not know the reasoning behind the exact weights. Probably this simply worked well on their datasets.

You can't adjust the weight / algorithm selection. If you need different weights / algorithms there are a couple of options depending on your exact requirements:

lamaeldo commented 7 months ago

thanks for your response, i makes sense. Among others, I am interested in testing out Jaro-Winkler as a replacement for the indel distance. Would changing all calls from indel to Jaro-Winkler in the ratio functions be enough?

maxbachmann commented 7 months ago

Would changing all calls from indel to Jaro-Winkler in the ratio functions be enough?

Changing them where?

lamaeldo commented 6 months ago

in fuzz_py.py . But by the looks of it, some Indel functions have no equivalent in Jaro Winkler. Do you think it is feasible to rewrite their equivalents for Jaro-Winkler, to have an implementation of Wratio with Jaro-Winkler (or other algorithms, for that matter)?

maxbachmann commented 6 months ago

As a difference the Indel distance is a count of edit operations, while the Jaro-Winkler similarity is always normalized. The whole fuzz module operates on the normalized similarity though, so this wouldn't be an issue there.

There are a couple of things to keep in mind when replacing the underlying implementation though: