taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Near matches get lost with increasing values of max_l_dist #38

Open davidefiocco opened 2 years ago

davidefiocco commented 2 years ago

To reproduce I am using fuzzysearch==0.7.3 and running

text = "foo bar spam eggs "
query = "four"

with max_l_dist=2 I get one match with

fuzzysearch.find_near_matches(query, text, max_l_dist=2)
[Match(start=0, end=4, dist=2, matched='foo ')]

with max_l_dist=3 I get the previous one with an additional one

fuzzysearch.find_near_matches(query, text, max_l_dist=3)
[Match(start=0, end=4, dist=2, matched='foo '),
 Match(start=6, end=7, dist=3, matched='r')]

but with max_l_dist=4 I fail to get previous ones.

fuzzysearch.find_near_matches(query, text, max_l_dist=4)
[Match(start=0, end=0, dist=4, matched=''),
 Match(start=1, end=1, dist=4, matched=''),
 Match(start=2, end=2, dist=4, matched=''),
 Match(start=3, end=3, dist=4, matched=''),
 Match(start=4, end=4, dist=4, matched=''),
 Match(start=5, end=5, dist=4, matched=''),
 Match(start=6, end=6, dist=4, matched=''),
 Match(start=7, end=7, dist=4, matched=''),
 Match(start=8, end=8, dist=4, matched=''),
 Match(start=9, end=9, dist=4, matched=''),
 Match(start=10, end=10, dist=4, matched=''),
 Match(start=11, end=11, dist=4, matched=''),
 Match(start=12, end=12, dist=4, matched=''),
 Match(start=13, end=13, dist=4, matched=''),
 Match(start=14, end=14, dist=4, matched=''),
 Match(start=15, end=15, dist=4, matched=''),
 Match(start=16, end=16, dist=4, matched=''),
 Match(start=17, end=17, dist=4, matched=''),
 Match(start=18, end=18, dist=4, matched='')]

Is this intended behaviour?

taleinat commented 2 years ago

Hi @davidefiocco, apologies for the late response.

Yes, this is currently the intended behavior.

The reason is that once the maximum distance is equal to (or greater than) the length of what you're searching for (query in your example), even an empty string is a valid match.

However, looking at your example, I can see that this behavior isn't great: There are matches with a lower distance in the text, but these are no longer returned when the max. distance is too large.

I'll think about how this can be improved without complicating things.