taleinat / fuzzysearch

Find parts of long text or data, allowing for some changes/typos.
MIT License
301 stars 26 forks source link

Call with looser #28

Closed DanielBiskup closed 4 years ago

DanielBiskup commented 4 years ago

On v0.7.1; Related to Issue 18 and Issue 27

I would expect a call to find_near_matches with looser constraints to return a superset of what a call with more restrictive constraints returns. But that's not the case:

text_to_search_for = "death"
text_to_search_in = "after de t, is candy"

matches = find_near_matches(text_to_search_for, text_to_search_in, max_l_dist=2, max_substitutions=1, max_insertions=0, max_deletions=2)
print(f"Matches with more constraints:\n{matches}")

matches = find_near_matches(text_to_search_for, text_to_search_in, max_l_dist=2, max_substitutions=2, max_insertions=2, max_deletions=2)
# which is equivalent to:
#  matches = find_near_matches(text_to_search_for, text_to_search_in, max_l_dist=2)
print(f"Matches with less constraints:\n{matches}")

outputs

Matches with more constraints:
[Match(start=6, end=10, dist=2, matched='de t')]
Matches with less constraints:
[Match(start=6, end=11, dist=2, matched='de t,')]

In the case with less constraints I would have expected matches to also include Match(start=6, end=10, dist=2, matched='de t') like

Matches with less constraints:
[Match(start=6, end=10, dist=2, matched='de t'), Match(start=6, end=11, dist=2, matched='de t,')]
taleinat commented 4 years ago

Hi @DanielBiskup,

find_near_matches() avoids returning overlapping matches. Instead, it consolidates groups of overlapping matches, and returns the "best" of each group: the match with the lowest distance, and of those, the longest match. Therefore, the behavior you describe is the intended one.

The various internal search functions will behave more closely to your expectations, but differ in exactly which results they will return. Some of them will not exhaustively return all options.

For some more details, see the usage page in the docs.