seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

Faulty result of partial ratio (without python-Levenshtein) #264

Open funytan opened 4 years ago

funytan commented 4 years ago

It is known that partial_ratio calculation yields incorrect results for some combinations of strings when it uses the python-Levenshtein SequenceMatcher https://github.com/seatgeek/fuzzywuzzy/issues/79#issue-58664443

However after removing it, for certain string cases, fuzzywuzzy without python-Levenshtein does not work.

> fuzz.partial_ratio('home sweet home', ' home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

> 13

And interesting enough, installing python-Levenshtein gives the correct score of 100.

This problem seems to happen when the comparison is made between a short and much longer string.

Has anyone faced this before?

aniketcomps commented 4 years ago

I noticed if you delete the preceding space in the longer string, then expected score of 100 is achieved. I couldn't figure out why. If your purpose is to get similarity involving long string then removing preceding and trailing spaces just might do the trick, PS: I am using pure-python Sequence matcher and not python-Levenshtein

fuzz.partial_ratio('home sweet home', 'home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

Out[32]: 100
funytan commented 4 years ago

@aniketcomps thanks! It works fine when deleting the preceding space, but when I tried to remove that space and the word and space after that, it fails again! Haha. Im using pure-python Sequence matcher as well.

fuzz.partial_ratio('home sweet home', 'sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

Out[773]: 13
maxbachmann commented 4 years ago

As a I described here: https://github.com/seatgeek/fuzzywuzzy/issues/279 this is most likely caused by the automatic junk heuristic of difflib which is not deactivated by fuzzywuzzy