Closed DanielBiskup closed 4 years ago
Hi @DanielBiskup,
As I mentioned in my comment on issue #28, find_near_matches()
avoids returning overlapping matches, and therefore your expectation of more lenient search criteria always resulting in more matches doesn't always hold, including specifically in examples such as those you've used where the matches all overlap.
Additionally, when max_l_dist
is greater than or equal to the length of the searched sub-string, the search is trivial - any position can be considered a match. In such cases, find_near_matches()
currently returns a list of matches with empty strings, one at each position in the searched string. This is what happened in the second call in your first example, and in the final two calls in your second example.
I realize that this last point is far from obvious, and the results don't make it very clear. I'm open to suggestions on how this could be improved, whether by returning different results, raising an exception in such cases, or improving the documentation.
If you really need a function which will return all possible matches, you can try one of the low-level internal functions, perhaps fuzzysearch.generic.find_near_matches_generic_linear_programming()
or fuzzysearch.levenshtein.find_near_matches_levenshtein_linear_programming()
. Note, however, that this may be considerably slower, and may return a very large number of overlapping results when called with a large max. allowed match distance.
I'm closing this due to the behavior being as expected and no further response was received from the poster (@DanielBiskup) . Feel free to continue the discussion if needed, and I'll re-open the issue if necessary.
This is on v0.7.1
outputs
Note how the matches with
dist=0
that are present inmatches_0
are missing frommatches_1
, which is not what I expected.Another example:
outputs
Where the results for
max_l_dist = 1
and2
are missing matches that appear formax_l_dist=0
.Similarly the results for
max_l_dist = 3
andmax_l_dist = 4
are missing the one match frommax_l_dist = 2
.Is this intended behavior? If yes, would you mind to help me understand?