Details on Wratio - Githubissues

lamaeldo commented 7 months ago

Hello! Could you provide some detail on the calculations behind fuzz.WRatio? Is it a combined score of all the other ratio algorithms with even weights? Can the list of algorithms used be edited, or the weights changed? Thanks!

maxbachmann commented 7 months ago

In pseudo-code it does the following:

len_ratio = max(len1, len2) / min(len1, len2)
if len_ratio < 1.5:
    return max(
        ratio(s1, s2),
        token_ratio(s1, s2) * 0.95
    )
else:
   scale = 0.9 if len_ratio < 8.0 else 0.6
    return max(
        partial_ratio(s1, s2) * scale,
        partial_token_ratio(s1, s2) * 0.95 * scale
    )

This algorithm was originally created by Seatgeek in their fuzzywuzzy library and I do not know the reasoning behind the exact weights. Probably this simply worked well on their datasets.

You can't adjust the weight / algorithm selection. If you need different weights / algorithms there are a couple of options depending on your exact requirements:

does it have to be a scorer supported by the process module?
what's the value range it returns? 0-100 with 100 being the optimal score or something else?
is performance a major concern?

lamaeldo commented 7 months ago

thanks for your response, i makes sense. Among others, I am interested in testing out Jaro-Winkler as a replacement for the indel distance. Would changing all calls from indel to Jaro-Winkler in the ratio functions be enough?

maxbachmann commented 7 months ago

Would changing all calls from indel to Jaro-Winkler in the ratio functions be enough?

Changing them where?

lamaeldo commented 6 months ago

in fuzz_py.py . But by the looks of it, some Indel functions have no equivalent in Jaro Winkler. Do you think it is feasible to rewrite their equivalents for Jaro-Winkler, to have an implementation of Wratio with Jaro-Winkler (or other algorithms, for that matter)?

maxbachmann commented 6 months ago

As a difference the Indel distance is a count of edit operations, while the Jaro-Winkler similarity is always normalized. The whole fuzz module operates on the normalized similarity though, so this wouldn't be an issue there.

There are a couple of things to keep in mind when replacing the underlying implementation though:

fuzz_py.py is just a pure Python fallback for platforms where the C extension fails to compile. Users would usually use an accellerated version implemented in C++
The metrics often use optimizations based on the knowledge of the Indel distance. E.g. in partial_ratio some of the substrings can be skipped, since it's know that they can't be an optimal match. token_set_ratio and partial_token_set_ratio have similar optimizations. These assumptions might not hold for different metrics, so they would need to be reevaluated.
Sometimes internal APIs for the indel distance are used, which are able to cache part of the calculation. These are probably the functions you did not find an equivalent for. The algorithms can be implemented completely without them, since they are just used as a performance optimization.
Whether replacing the metric actually improves the quality of matches is obviously use case dependent and would need to be evaluated on your datasets

rapidfuzz / RapidFuzz

Details on Wratio #359