taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Unexpected behavior with only insertions allowed #29

Open artemuk opened 4 years ago

artemuk commented 4 years ago

Hi! I'm using fuzzysearch v0.7.1 and I got a bit unexpected results when only insertions are allowed. Here is the code:

matches = find_near_matches('12345', '123aa', max_deletions=0, max_substitutions=0, max_insertions=3)
print(matches)

Expected output:

[Match(start=0, end=3, dist=2, matched='123')]

I expected that substring '123' can be matched since we can insert '4' and '5' after it

Actual output

[]

But actually it doesn't return any matches.

Please tell me, am I missing something? Is this an intended behavior or a bug? Thank you in advance!

OS: Ubuntu 19.04 Setup: Python 3.7 (miniconda), fuzzysearch 0.7.1

taleinat commented 4 years ago

Hi @artemuk,

Thank you for such a clear and thorough description of the issue!

fuzzysearch uses the opposite meaning of what insertion/deletion means: It defines them relative to the searched sub-string. In your case, '12345' can be turned into '123' using 2 deletions, so allowing deletions rather than insertions gives the expected result:

>>> find_near_matches('12345', '123aa', max_deletions=3, max_substitutions=0, max_insertions=0)
[Match(start=0, end=3, dist=2, matched='123')]

In other words, find_near_matches() finds parts of the searched sequence, which can be arrived at from the searched string (e.g. '12345') by making the manipulations allowed by the search parameters.

The docs currently don't mention this at all, unfortunately, and they certainly should! I'll add something about this to the docs soon.

artemuk commented 4 years ago

Hi @taleinat,

Thank you very much for the explanation! It makes perfect sense now. I should've figured it out myself since it's actually mentioned in the docstring (now I see it :)) But yea, a bit more explicit mention of it will definitely do good!