Closed lamaeldo closed 5 months ago
In pseudo-code it does the following:
len_ratio = max(len1, len2) / min(len1, len2)
if len_ratio < 1.5:
return max(
ratio(s1, s2),
token_ratio(s1, s2) * 0.95
)
else:
scale = 0.9 if len_ratio < 8.0 else 0.6
return max(
partial_ratio(s1, s2) * scale,
partial_token_ratio(s1, s2) * 0.95 * scale
)
This algorithm was originally created by Seatgeek in their fuzzywuzzy library and I do not know the reasoning behind the exact weights. Probably this simply worked well on their datasets.
You can't adjust the weight / algorithm selection. If you need different weights / algorithms there are a couple of options depending on your exact requirements:
thanks for your response, i makes sense. Among others, I am interested in testing out Jaro-Winkler as a replacement for the indel distance. Would changing all calls from indel to Jaro-Winkler in the ratio functions be enough?
Would changing all calls from indel to Jaro-Winkler in the ratio functions be enough?
Changing them where?
in fuzz_py.py . But by the looks of it, some Indel functions have no equivalent in Jaro Winkler. Do you think it is feasible to rewrite their equivalents for Jaro-Winkler, to have an implementation of Wratio with Jaro-Winkler (or other algorithms, for that matter)?
As a difference the Indel
distance is a count of edit operations, while the Jaro-Winkler
similarity is always normalized. The whole fuzz module operates on the normalized similarity though, so this wouldn't be an issue there.
There are a couple of things to keep in mind when replacing the underlying implementation though:
fuzz_py.py
is just a pure Python fallback for platforms where the C extension fails to compile. Users would usually use an accellerated version implemented in C++partial_ratio
some of the substrings can be skipped, since it's know that they can't be an optimal match. token_set_ratio
and partial_token_set_ratio
have similar optimizations. These assumptions might not hold for different metrics, so they would need to be reevaluated.
Hello! Could you provide some detail on the calculations behind fuzz.WRatio? Is it a combined score of all the other ratio algorithms with even weights? Can the list of algorithms used be edited, or the weights changed? Thanks!