Returning Partial Match If capitalized

spooknik commented 4 years ago

Hello again,

Just noticed unexpected behavior while working with some of my data. In some instances where i'm matching a word that is uppercase with a word that is lower Fuzzysearch only returns a partial match with the first letter missing. For example:

fuzzy_match = find_near_matches('double', 'Double', max_l_dist=2)
print(fuzzy_match)
[Match(start=1, end=6, dist=1, matched='ouble')]

The match is only partial and returns 'ouble', the index for the string (1:6) also only is 'ouble'.

Thanks once again 👍

taleinat commented 4 years ago

Hi @spooknik!

In this case, 'ouble' and 'Double' both have a Levenshtein distance of 1 from 'double'. Returning multiple, overlapping results in cases like this is usually not useful, so fuzzysearch chooses a single "good" result. Since the initial 'D' doesn't actually match, fuzzysearch chooses to return the shorter result in this case.

BTW, if you want to run case-insensitive searches, you can do so by first calling the .lower() or .casefold() methods of the sequences and sub-sequences.

We can dig a bit deeper if you provide some more background on what exactly you're trying to achieve.

spooknik commented 4 years ago

Thanks for the explanation, that makes perfect sense why 'ouble' is returned in this case.

My background is a translation tool that will find and replace matched terms in a body of text. There is a list of known English terms and a parallel list of translated terms. The words in the text won't always 100% match what's in the English terms; i.e. misspellings, hyphens, British vs. American English.

So Fuzzysearch is way to pick up the variations in the input text and still get a match with known English terms.

My code looks like:

input_text = 'Long body of text that will contain the text to be translated.'

def fuzzySearch(term, text, return_string=None, max_distance=2):
    matches = find_near_matches(term, text, max_l_dist=max_distance)
    phrase = ([text[m.start:m.end] for m in matches])  # Haven't updated this since 0.7.0 was released :)
    if not return_string:
        return phrase
    else:
        str1 = " "
        return (str1.join(phrase))

keyword_processor = KeywordProcessor() # Using Flashtext for word replacement
for item in source_list:  # source_list is term with English Terms
    replace_index = source_list.index(item)
    fuzzy_match = fuzzySearch(item, input_text, return_string=True,  max_distance=2)
    keyword_processor.add_keyword(fuzzy_match, target_list[replace_index]) # target_list is term with translated Terms
    input_text_replaced = keyword_processor.replace_keywords(input_text )

In my case I can just add fuzzy_match = fuzzySearch(item.lower(), input_text_list.lower(), max_distance=2) and the problem is solved.

The problem I was having is that sometimes words that are in the English list are in the text body but they aren't replaced and I was just working backwards to find out why and stumbled upon the 'ouble' behavior explained above.

Forgive the long-winded explanation 👍

taleinat commented 4 years ago

Thanks for the details @spooknik, it's good to know how fuzzysearch is being used and the background for your issue!

I'm closing this issue since this doesn't bring to light an actual bug or missing feature.

taleinat / fuzzysearch

Returning Partial Match If capitalized #17