seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.2k stars 878 forks source link

how ratio in fuzzy-wuzzy calculated? #289

Open fatimamb opened 3 years ago

fatimamb commented 3 years ago

I am trying to understand the score in fuzzy-wuzzy calculated. so for now I know it depends on SequenceMatcher from difflib package. and as shown in difflib document the score calculated as this link:

Return a measure of the sequences’ similarity as a float in the range [0, 1].

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.
 Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.

but my first question what 2.0 referred to?

also, in get_opcodes, there is equal, replace and delete.

s = SequenceMatcher("private","privateT")
    for opcode in s.get_opcodes():
          print "%6s a[%d:%d] b[%d:%d]" % opcode

my second question does any of them affect the ratio score?

I had read some posts as here taking about the cost in edit distance, is that consider in fuzzy-wuzzy or difflib score?

thank you

MahmoudAliEng commented 3 years ago

As far as I know that FW uses the Levenshtein similarity ratio. You can find more explanation about its logic in this amazing article.