implemented token_sim_ratio() function with cosine similarity

seatgeek / fuzzywuzzy

Fuzzy String Matching in Python

http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

GNU General Public License v2.0

9.2k stars 878 forks source link

implemented token_sim_ratio() function with cosine similarity #296

Open Exquisition opened 3 years ago

Exquisition commented 3 years ago

Implemented solution to the following issue: https://github.com/seatgeek/fuzzywuzzy/issues/272

token_sim_ratio(s1, s2 ... ) robustly handles any issues associated with lexicographic sorting of tokens for the 2nd string introduced by fuzz.token_sort_ratio(s1, s2...). The similarity is calculated using cosine similarity, other similarity measures could be integrated easily (built-in leveinstein, Jaro-Winkler, etc).

nol13 commented 3 years ago

Love the idea! addresses one of the main cases where you would get sub-optimal results from this. Was messing around with porting this PR into fuzzball.js.

Wondering.. would it work if a version of token_set also use the similarity sort?

Like maybe using the similarity sort here could work?

sorted_2to1 = " ".join(sorted(diff2to1))

Also partial is handled in _token_sim but currently it will always be False?

nol13 commented 3 years ago

Also, not sure maintenance status of this anyway, but can add the new functions to process.py line 97 or it will miss some optimization. Probably some other optimizations hidden in there too if you can say avoid recalculating the counters every time.

nol13 commented 3 years ago

Haven't tested but looks order of the arguments might matter though too in some cases? Not sure if ti would matter enough to try running it both ways

nol13 commented 3 years ago

Was getting good results in testing, I added experimental support for this into fuzzball.js 1.4! Referenced this PR in the docs. Sorted the arguments by # of tokens or string length before doing the similarity sort, seemed to make sense to give the shorter one more precedence when sorting, and at least it should be consistent. Also have added an option to use the similarity sort when calculating token_set_ratio.