taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Call with max_l_dist=1 results in fewer matches then with max_l_dist=0 #27

Closed DanielBiskup closed 4 years ago

DanielBiskup commented 4 years ago

This is on v0.7.1

text_to_search_for = "x"
text_to_search_in = "xxx"

matches_0 = fuzzysearch.find_near_matches(text_to_search_for, text_to_search_in, max_l_dist=0)
print(f"Matches with max_l_dist=0\n{matches_0}")
print()
matches_1 = fuzzysearch.find_near_matches(text_to_search_for, text_to_search_in, max_l_dist=1)
print(f"Matches with max_l_dist=1\n{matches_1}")

outputs

Matches with max_l_dist=0
[Match(start=0, end=1, dist=0, matched='x'), Match(start=1, end=2, dist=0, matched='x'), Match(start=2, end=3, dist=0, matched='x')]

Matches with max_l_dist=1
[Match(start=0, end=0, dist=1, matched=''), Match(start=1, end=1, dist=1, matched=''), Match(start=2, end=2, dist=1, matched=''), Match(start=3, end=3, dist=1, matched='')]

Note how the matches with dist=0 that are present in matches_0 are missing from matches_1, which is not what I expected.

Another example:

text_to_search_for = "xxx"
text_to_search_in = "xxxxx"
for i in range(5):
    matches = fuzzysearch.find_near_matches(text_to_search_for, text_to_search_in, max_l_dist=i)
    print(f"Matches with max_l_dist={i}; len(text_to_search_for)={len(text_to_search_for)}\n{matches}\n")

outputs

Matches with max_l_dist=0; len(text_to_search_for)=3
[Match(start=0, end=3, dist=0, matched='xxx'), Match(start=1, end=4, dist=0, matched='xxx'), Match(start=2, end=5, dist=0, matched='xxx')]

Matches with max_l_dist=1; len(text_to_search_for)=3
[Match(start=1, end=4, dist=0, matched='xxx')]

Matches with max_l_dist=2; len(text_to_search_for)=3
[Match(start=0, end=3, dist=0, matched='xxx')]

Matches with max_l_dist=3; len(text_to_search_for)=3
[Match(start=0, end=0, dist=3, matched=''), Match(start=1, end=1, dist=3, matched=''), Match(start=2, end=2, dist=3, matched=''), Match(start=3, end=3, dist=3, matched=''), Match(start=4, end=4, dist=3, matched=''), Match(start=5, end=5, dist=3, matched='')]

Matches with max_l_dist=4; len(text_to_search_for)=3
[Match(start=0, end=0, dist=3, matched=''), Match(start=1, end=1, dist=3, matched=''), Match(start=2, end=2, dist=3, matched=''), Match(start=3, end=3, dist=3, matched=''), Match(start=4, end=4, dist=3, matched=''), Match(start=5, end=5, dist=3, matched='')]

Where the results for max_l_dist = 1 and 2 are missing matches that appear for max_l_dist=0.

Similarly the results for max_l_dist = 3 and max_l_dist = 4 are missing the one match from max_l_dist = 2.

Is this intended behavior? If yes, would you mind to help me understand?

taleinat commented 4 years ago

Hi @DanielBiskup,

As I mentioned in my comment on issue #28, find_near_matches() avoids returning overlapping matches, and therefore your expectation of more lenient search criteria always resulting in more matches doesn't always hold, including specifically in examples such as those you've used where the matches all overlap.

Additionally, when max_l_dist is greater than or equal to the length of the searched sub-string, the search is trivial - any position can be considered a match. In such cases, find_near_matches() currently returns a list of matches with empty strings, one at each position in the searched string. This is what happened in the second call in your first example, and in the final two calls in your second example.

I realize that this last point is far from obvious, and the results don't make it very clear. I'm open to suggestions on how this could be improved, whether by returning different results, raising an exception in such cases, or improving the documentation.

If you really need a function which will return all possible matches, you can try one of the low-level internal functions, perhaps fuzzysearch.generic.find_near_matches_generic_linear_programming() or fuzzysearch.levenshtein.find_near_matches_levenshtein_linear_programming(). Note, however, that this may be considerably slower, and may return a very large number of overlapping results when called with a large max. allowed match distance.

taleinat commented 4 years ago

I'm closing this due to the behavior being as expected and no further response was received from the poster (@DanielBiskup) . Feel free to continue the discussion if needed, and I'll re-open the issue if necessary.