How to get fuzzy index?

seatgeek / fuzzywuzzy

Fuzzy String Matching in Python

http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

GNU General Public License v2.0

9.2k stars 878 forks source link

How to get fuzzy index? #293

Open mridu-enigma opened 3 years ago

mridu-enigma commented 3 years ago

I am using fuzzywuzzy to look for key-phrase like terms in corpuses.

FWIW, when there's a tie I'd like the tie to be broken by earliest match, so: is there a way to get the fuzzy-index of a match? I tried all functions in fuzz and process (using dir() to discover funcs like QWRatio, etc.)

For instance, I want some mechanism that ranks fuzz.partial_ratio('alex', 'alexa not') higher than fuzz.partial_ratio('alex', 'not alexa'), but for fuzzy matches (that's a simplistic example). How can I achieve this?

acslater00 commented 3 years ago

I haven't thought about it deeply, but I think it's an ambiguous problem definition. You probably need a minimum threshold that you would consider a match before being able to implement. If you did that, a naive solution would be to just iterate the string and return the first index with a fuzzy-threshold matching your search key. There might be other more optimized solutions that would scale to longer strings. Hope this helps.

s-c-p commented 3 years ago

@mridu-enigma I don't understand it.

You want index of first token match or index where max similarity (density) is observed?

Say you are searching for o ordin in strings like co-ordinate, what output do you seek? 1 or 3?

@acslater00 I haven't grokked the entire code, perhaps you could shed some light of correct behavior.

maxbachmann commented 3 years ago

fuzz.partial_ratio searches for the best alignment of the shorter string to the longer string. It does not matter which way you insert them in as long as they do not have a similar length (for similar lengths the results can differ) In the code this is achieved by

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

Since you mention that you match a key against a phrase I assume that the key always has to be shorter than the phrase, so you might be able to implement this the following way:

def your_scorer(s1, s2):
  if len(s1) > len(s2):
    return 0
  return fuzz.partial_ratio(s1, s2)