seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

Feature Suggestion sort order matches by common letter count largest to smallest #280

Open robsomething opened 3 years ago

robsomething commented 3 years ago

I am noticing that some of my matches in which I have one term as a subset of another term for partial_set_token come back with the non-optimal choice. For the sort order when having ties, there needs to be a better way that is independent of the order of the data. Perhaps using total common tokens (or letters).

"Company" and "Company 1" has a score of 100 "Company 1" and "Company 1" has a score of 100 It would seem that the second pairing would be the better match.

query = 'Company 2' choices = ['Company' ,'Company 1', 'Company 2', 'Awesome Company' ] process.extractOne(query, choices, scorer= fuzz.partial_token_set_ratio)

Out[72]: ('Company', 100) The winner always seems to be the first in the list of choices. While one could order both lists before using the functions, that could create a different kind of bias in which we would never match to the appropriate choice when the tokens are in the middle of the choice string.

Similar behavior when using the partial_token_sort_ratio scorer.