seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

user process.extract for chinese returns wrong result #240

Open PalaChen opened 5 years ago

PalaChen commented 5 years ago

user python2 for example

` choices = [u"星球大战",u"5月4日星球大战", u"星球大戰", u"战大球星", u"星球大战游戏下"] process.extract(u"星球大战", choices)

[(u'星球大战', 0), (u'5月4日星球大战', 0), (u'星球大戰', 0), (u'战大球星', 0), (u'星球大战游戏下', 0)] `

but

fuzz.ratio(u"星球大战", u"星球大战1") 89

maxbachmann commented 3 years ago

The default scorer that is selected by process.extract is fuzz.Wratio, which by default converts all non ascii characters to whitespaces and trims them. So in your case your comparing empty strings. So in your case use:

process.extract(u"星球大战", choices, scorer=fuzz.UWRatio)

or since you mentioned fuzz.ratio

process.extract(u"星球大战", choices, scorer=fuzz.ratio)