taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Returning Partial Match If capitalized #17

Closed spooknik closed 4 years ago

spooknik commented 4 years ago

Hello again,

Just noticed unexpected behavior while working with some of my data. In some instances where i'm matching a word that is uppercase with a word that is lower Fuzzysearch only returns a partial match with the first letter missing. For example:

fuzzy_match = find_near_matches('double', 'Double', max_l_dist=2)
print(fuzzy_match)
[Match(start=1, end=6, dist=1, matched='ouble')]

The match is only partial and returns 'ouble', the index for the string (1:6) also only is 'ouble'.

Thanks once again 👍

taleinat commented 4 years ago

Hi @spooknik!

In this case, 'ouble' and 'Double' both have a Levenshtein distance of 1 from 'double'. Returning multiple, overlapping results in cases like this is usually not useful, so fuzzysearch chooses a single "good" result. Since the initial 'D' doesn't actually match, fuzzysearch chooses to return the shorter result in this case.

BTW, if you want to run case-insensitive searches, you can do so by first calling the .lower() or .casefold() methods of the sequences and sub-sequences.

We can dig a bit deeper if you provide some more background on what exactly you're trying to achieve.

spooknik commented 4 years ago

Thanks for the explanation, that makes perfect sense why 'ouble' is returned in this case.

My background is a translation tool that will find and replace matched terms in a body of text. There is a list of known English terms and a parallel list of translated terms. The words in the text won't always 100% match what's in the English terms; i.e. misspellings, hyphens, British vs. American English.

So Fuzzysearch is way to pick up the variations in the input text and still get a match with known English terms.

My code looks like:

input_text = 'Long body of text that will contain the text to be translated.'

def fuzzySearch(term, text, return_string=None, max_distance=2):
    matches = find_near_matches(term, text, max_l_dist=max_distance)
    phrase = ([text[m.start:m.end] for m in matches])  # Haven't updated this since 0.7.0 was released :)
    if not return_string:
        return phrase
    else:
        str1 = " "
        return (str1.join(phrase))

keyword_processor = KeywordProcessor() # Using Flashtext for word replacement
for item in source_list:  # source_list is term with English Terms
    replace_index = source_list.index(item)
    fuzzy_match = fuzzySearch(item, input_text, return_string=True,  max_distance=2)
    keyword_processor.add_keyword(fuzzy_match, target_list[replace_index]) # target_list is term with translated Terms
    input_text_replaced = keyword_processor.replace_keywords(input_text )

In my case I can just add fuzzy_match = fuzzySearch(item.lower(), input_text_list.lower(), max_distance=2) and the problem is solved.

The problem I was having is that sometimes words that are in the English list are in the text body but they aren't replaced and I was just working backwards to find out why and stumbled upon the 'ouble' behavior explained above.

Forgive the long-winded explanation 👍

taleinat commented 4 years ago

Thanks for the details @spooknik, it's good to know how fuzzysearch is being used and the background for your issue!

I'm closing this issue since this doesn't bring to light an actual bug or missing feature.