seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.2k stars 878 forks source link

What is the max possible value (upper bound) for fuzz.ratio? #310

Closed sillybun closed 3 years ago

sillybun commented 3 years ago

It would be helpful to know what is the max possible value (upper bound) for:

$fuzz.ratio(sx, sy)$

where the length of $sx$ is $x$ and length of $sy$ is $y$ (x <= y).

It seems that $100 * \sqrt(x / y)$ is a roughly approximation.

maxbachmann commented 3 years ago

fuzz.ratio is a normalized version of the InDel-Distance (similar to Levenshtein but without Substitutions) scaled to the range 0-100:

round(100 * (1 - InDelDist / (len1 + len2)))

so the upper bound is 100

maxbachmann commented 3 years ago

Rereading your question I think you might mean a length based similarity score which is a upper bound for the similarity. Both for Levenshtein and InDel Distance the distance between two strings is at least the length difference, so in your example with len1 <= len2 the upper bound can be calculated as:

100 * (1 - (len2 - len1) / (len1 + len2))
sillybun commented 3 years ago

Thanks very much! It helps a lot!

maxbachmann commented 3 years ago

@sillybun btw I use this as early exit condition in RapidFuzz when a score_cutoff argument is provided to the function ;)