taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

ValueError: end_index must be non-negative (again) #32

Open jtlz2 opened 4 years ago

jtlz2 commented 4 years ago

This presents just as in #13. See below to reproduce. Awesome module, thanks!

Version info:

Python 2.7.16 |Anaconda custom (64-bit)| (default, Aug 22 2019, 10:59:10)
fuzzysearch.__version__ = 0.7.2

import fuzzysearch
fuzzysearch.find_near_matches('ABC 0123456', 'ABC', max_l_dist=1).next()

Traceback:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-9ccf0e63dac4> in <module>()
----> 1 fuzzysearch.find_near_matches('ABC 0123456', 'ABC', max_l_dist=1).next()

/anaconda2/lib/python2.7/site-packages/fuzzysearch/__init__.pyc in find_near_matches(subsequence, sequence, max_substitutions, max_insertions, max_deletions, max_l_dist)
     55     search_class = choose_search_class(search_params)
     56     matches = search_class.search(subsequence, sequence, search_params)
---> 57     return search_class.consolidate_matches(matches)
     58
     59

/anaconda2/lib/python2.7/site-packages/fuzzysearch/levenshtein.pyc in consolidate_matches(cls, matches)
    159     @classmethod
    160     def consolidate_matches(cls, matches):
--> 161         return consolidate_overlapping_matches(matches)
    162
    163     @classmethod

/anaconda2/lib/python2.7/site-packages/fuzzysearch/common.pyc in consolidate_overlapping_matches(matches)
    186 def consolidate_overlapping_matches(matches):
    187     """Replace overlapping matches with a single, "best" match."""
--> 188     groups = group_matches(matches)
    189     best_matches = [get_best_match_in_group(group) for group in groups]
    190     return sorted(best_matches)

/anaconda2/lib/python2.7/site-packages/fuzzysearch/common.pyc in group_matches(matches)
    162 def group_matches(matches):
    163     groups = []
--> 164     for match in matches:
    165         overlapping_groups = [g for g in groups if g.is_match_in_group(match)]
    166         if not overlapping_groups:

/anaconda2/lib/python2.7/site-packages/fuzzysearch/levenshtein.pyc in search(cls, subsequence, sequence, search_params)
    154     def search(cls, subsequence, sequence, search_params):
    155         for match in find_near_matches_levenshtein(subsequence, sequence,
--> 156                                                    search_params.max_l_dist):
    157             yield match
    158

/anaconda2/lib/python2.7/site-packages/fuzzysearch/levenshtein_ngram.pyc in find_near_matches_levenshtein_ngrams(subsequence, sequence, max_l_dist)
    175         start_index = max(0, ngram_start - max_l_dist)
    176         end_index = min(seq_len, seq_len - subseq_len + ngram_end + max_l_dist)
--> 177         for index in search_exact(subsequence[ngram_start:ngram_end], sequence, start_index, end_index):
    178             # try to expand left and/or right according to n_ngram
    179             dist_right, right_expand_size = _expand(

/anaconda2/lib/python2.7/site-packages/fuzzysearch/search_exact.pyc in search_exact(subsequence, sequence, start_index, end_index)
     69         try:
     70             return search_exact_byteslike(subsequence, sequence,
---> 71                                           start_index, end_index)
     72         except (TypeError, UnicodeEncodeError):
     73             return _search_exact(subsequence, sequence, start_index, end_index)

ValueError: end_index must be non-negative
taleinat commented 4 years ago

Awesome module, thanks!

Thanks for the kind words, I'm happy you're finding it useful! It would be great to hear what you're using it for.

taleinat commented 4 years ago

@jtlz2, which platform are you running this on? Windows / Linux / macOS, which exact version, 32 or 64 bit?

taleinat commented 4 years ago

@jtlz2, could you try running the same code, with bytes objects rather than strings? I.e.:

fuzzysearch.find_near_matches(b'ABC 0123456', b'ABC', max_l_dist=1).next()
jtlz2 commented 4 years ago

@taleinat Apologies - macOS 10.13.6..

We are trialling it for OCR post-processing.

The error comes out the same when using bytes as you suggest (ValueError at L71).

Thanks again!

taleinat commented 4 years ago

@jtlz2, I've started working on this. It seems like a problem with the native (C) extensions.

In the meantime, you may install fuzzysearch without the native extensions by fetching a source archive, unpacking it running python setup.py install --noexts.

taleinat commented 4 years ago

@jtlz2, I've fixed what appears to be the source of this issue. The fix is available in version 0.7.3 which I've just released. Please let me know if it resolves this issue for you!

jtlz2 commented 4 years ago

@taleinat Still get the same problem in 0.7.3 :\

taleinat commented 4 years ago

Still get the same problem in 0.7.3 :\

☹️

This seems to be related to the Anaconda distribution somehow, as it only appears to happen with it, but not with Python from python.org or built from the main git repo. I'll have to investigate further when I have more time.