taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Functions give different results #5

Closed AnnaNenarokova closed 10 years ago

AnnaNenarokova commented 10 years ago

Hi! I was looking for a suitable tool for Python for a long time, and was very pleased to find your package. Thank you for it. I have a question about results of functions. I tried this:

seq = 'NAGGTTGGTGGGTTGTTTTTATGGGATAAAATGCTTTAAGAACAAATGTATACTTTTAGAGAGTTCCCCGCGCCAGCGGGGATAAACCGTTGTCTTTCGCTGCTGAGGGTGACGATCCCGCGAGTTCCCTGCGCCAGGGGGGATAAACCGCTTTCGCAGACGCGCGGCGATACGCTCACGCAGAGTTGCCCGCGCCAGCGGGGATCAACCGCAGCCGAAGGCAAAGGTGATGACGAGATTGGAAGAGCGG'
subseq = 'GAGTTCCCCGCGCCAGCGGGGATAAACCGC'
for max_distance in range(5):
    print (find_near_matches(subseq, seq, max_distance))

It returns:

[]
[Match(start=60, end=90, dist=1)]
[Match(start=60, end=90, dist=1), Match(start=121, end=151, dist=2), Match(start=182, end=212, dist=2)]
[Match(start=60, end=90, dist=1), Match(start=182, end=212, dist=2)]
[Match(start=60, end=90, dist=1)]

I was very surprised, that when I increase the maximum distance, the number of matches is reduced and not increased. I tried to use find_near_matches_with_ngrams, but it gave the same result. Why is this happening? Maybe I'm doing something wrong? Thank you in advance!

taleinat commented 10 years ago

Hi Anna,

I'm happy fuzzysearch is useful for you!

Which version of fuzzysearch are you using? With the latest version, 0.2.2, find_near_matches(subseq, seq, 0) raises exception (as it should). Calling it with a limit for the maximum Levenshtein distance works and returns the expected results. This is done by passing max_l_dist=max_distance instead of just max_distance as a third parameter.

Here is my example interpreter session:

>>> import fuzzysearch
>>> fuzzysearch.__version__
'0.2.2'
>>> fuzzysearch.find_near_matches(subseq, seq, max_l_dist=4)
[Match(start=60, end=90, dist=1), Match(start=121, end=151, dist=2), Match(start=182, end=212, dist=2)]
>>> fuzzysearch.find_near_matches(subseq, seq, 3)
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    fuzzysearch.find_near_matches(subseq, seq, 3)
  File "C:\Python34\lib\site-packages\fuzzysearch\__init__.py", line 59, in find_near_matches
    raise ValueError('# insertions must be limited!')
ValueError: # insertions must be limited!
AnnaNenarokova commented 10 years ago

I have updated my fuzzysearch to 0.2.2, and everything works fine now. Thank you very much!

taleinat commented 10 years ago

Great!