seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

Feature to robustly handle token ordernings #272

Open shbunder opened 4 years ago

shbunder commented 4 years ago

Hi,

I use fuzzywuzzy to match full names extracted from documents to names in a database. Discarding order is import for this matching goal. Typically I use fuzz.token_sort_ratio to obtain:

fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")
> 100

As the names suggest this function sorts the individual tokens, however in multiple instances this gave undesirable results, e.g.

fuzz.token_sort_ratio("willy` wonka", "willy zonka")
> 91
fuzz.token_sort_ratio("willy` wonka", "willy vonka")
> 45

To cope with this I would propose a robust token_sim_ratio function that sorts the second list of tokens according to its similarity with the tokens in the first list. I have currently implemented a light-weight solution based on ngram-matching that is robust to mistakes in the first letter of tokens.

My question; is there a general appetite for such a functionality, and if so should I proceed with making a PR for this feature?