Open funytan opened 4 years ago
I noticed if you delete the preceding space in the longer string, then expected score of 100 is achieved. I couldn't figure out why. If your purpose is to get similarity involving long string then removing preceding and trailing spaces just might do the trick, PS: I am using pure-python Sequence matcher and not python-Levenshtein
fuzz.partial_ratio('home sweet home', 'home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')
Out[32]: 100
@aniketcomps thanks! It works fine when deleting the preceding space, but when I tried to remove that space and the word and space after that, it fails again! Haha. Im using pure-python Sequence matcher as well.
fuzz.partial_ratio('home sweet home', 'sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')
Out[773]: 13
As a I described here: https://github.com/seatgeek/fuzzywuzzy/issues/279 this is most likely caused by the automatic junk heuristic of difflib which is not deactivated by fuzzywuzzy
It is known that partial_ratio calculation yields incorrect results for some combinations of strings when it uses the python-Levenshtein SequenceMatcher https://github.com/seatgeek/fuzzywuzzy/issues/79#issue-58664443
However after removing it, for certain string cases, fuzzywuzzy without python-Levenshtein does not work.
And interesting enough, installing python-Levenshtein gives the correct score of 100.
This problem seems to happen when the comparison is made between a short and much longer string.
Has anyone faced this before?