rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

How to calculate similarity score by using fuzz.ratio() #339

Closed Pathcharnee closed 1 year ago

Pathcharnee commented 1 year ago

I've seen the example as below but I don't understand how it's come. Can anyone help to demonstrate the similarity score?

from rapidfuzz import fuzz fuzz.ratio("this is a test", "this is a test!") 96.55172413793103

maxbachmann commented 1 year ago

fuzz.ratio is based on the normalized Indel similarity. The Indel distance only allows insertions and deletions. So it behaves like the Levenshtein distance with substitutions weighted as 2. For your example:

>>> from rapidfuzz.distance import Indel
>>> from rapidfuzz import fuzz

# only one insertion of !
>>> Indel.distance("this is a test", "this is a test!")
1

# maximum - distance with maximum = len(s1) + len(s2) = 29
>>> Indel.normalized_distance("this is a test", "this is a test!")
0.034482758620689655

# 1.0 - normalized_distance
>>> Indel.normalized_similarity("this is a test", "this is a test!")
0.9655172413793104

# normalized_similarity * 100
>>> fuzz.ratio("this is a test", "this is a test!")
96.55172413793103