seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

[Feature Suggestion] Token permutation ratio #249

Open boechat107 opened 4 years ago

boechat107 commented 4 years ago

My motivation to suggest this feature comes from this example:

In : fuzz.token_sort_ratio('nfs-e Ndamero', 'numero nfs-e')
Out: 56

At least in my context, it is pretty clear that both strings are very very similar, but Python sorted function sorts tokens in an "undesired" way:

# Using lower() just simplify the analysis.
In : sorted('nfs-e Ndamero'.lower().split())
Out: ['ndamero', 'nfs-e']

In : sorted('numero nfs-e'.split())
Out: ['nfs-e', 'numero']

It might be interesting to add a function that calculates the match for all possible permutations of tokens:

from itertools import permutations, product

def fuzzy_token_permutation_ratio(text1: str, text2: str) -> int:
    def mk_permutations(text):
        return (' '.join(comb) for comb in permutations(text.lower().split()))
    t1_combs = mk_permutations(text1)
    t2_combs = mk_permutations(text2)
    return max(fuzz.ratio(t1, t2) for t1, t2 in product(t1_combs, t2_combs))

In : fuzzy_token_permutation_ratio('nfs-e Ndamero', 'nfs-e numero')
Out: 88

Does it make sense for someone else?

PS.: fuzzywuzzy is a great library!

shbunder commented 4 years ago

I posted a similar ticket for this same issue #272.

Contrary to your solution I would propose to more smartly sort the tokens - e.g. by counting matching ngrams between the tokens of both strings - before applying fuzz.ratio.

I fear that calculating the full ratio for each possible permutation will explode computational time, especially when performing this function on a large set of examples, e.g. in database matching.

boechat107 commented 4 years ago

Uhum, I see. I think you do have a point, @shbunder.