taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

stopping search after finding the first good acceptable approximate match #36

Closed tensorfoo closed 2 years ago

tensorfoo commented 3 years ago

hey thanks for this library. Just a question, is there a way to tell the search to bail early? This is mainly an issue with small patterns in a large corpus where you can easily find lots of approximate matches which can take too long to complete when you'd be happy with the first one that you encountered. Maybe an optional parameter nmatches which I could set to nmatches=1 to get the behaviour I want and by default it could be None to replicate the current behaviour. Though it doesn't make intuitive sense for that to be the case.. haha.

taleinat commented 3 years ago

Hi @tensorfoo, I'm happy you find fuzzysearch useful!

Many of the underlying implementations support this, being implemented as generator functions. You could use the following to get just the first match, or None if there are no matches:

from fuzzysearch.common import LevenshteinSearchParams
from fuzzysearch.generic_search import find_near_matches_generic

search_params = LevenshteinSearchParams(max_substitutions,
                                        max_insertions,
                                        max_deletions,
                                        max_l_dist)
matches_iterator = find_near_matches_generic(subsequence, sequence, search_params)
match = next(matches_iterator, None)

The find_near_matches() function doesn't support this use pattern though; it collects all possible matches and consolidates overlapping matches. Also, what I described above is not done consistently in the different case-specific optimized implementations, nor by some of the Cython and C implementations.

This is mainly an issue with small patterns in a large corpus where you can easily find lots of approximate matches which can take too long to complete when you'd be happy with the first one that you encountered.

Do try my suggestion above, as it should likely do nicely in such a case!

tensorfoo commented 3 years ago

Thank you so much. That's awesome. I am giving it a shot now, just a quick q, is it ok if I only specify max_l_dist=something and leave the rest unspecified (ie None?). This part of the code seems to suggest no but I might be misunderstanding. I just want the same result (first one) as find_near_matches() so if just setting max_l_dist achieves that, im happy.

edit. Update - My previous code was taking over 30mins to finish a job before I got impatient and killed it - but now with the restriction of grabbing the first one it finished in 3 mins. So it solved my problem nicely!

taleinat commented 2 years ago

is it ok if I only specify max_l_dist=something and leave the rest unspecified (ie None?)

Yes! You can specify pretty much any combination of limits (and you'll get a clear error if you happen to use an invalid combination).

edit. Update - My previous code was taking over 30mins to finish a job before I got impatient and killed it - but now with the restriction of grabbing the first one it finished in 3 mins. So it solved my problem nicely!

I'm happy to hear that!