Closed codinguncut closed 7 years ago
Thanks for the input. I admit that it's a quick and dirty recipe that comes with implicit caveats.
It seems to group by string start rather than string end
Yes, that's what it does. Should imply that shorter matches occur before longer matches. Didn't test inner overlaps, though. The result might look funny in that case, e.g. if you search for "hallo", "hall" and "al" at the same time, you would probably get three matches rather than two.
It uses the maximum by string comparison, rather than maximum length
Since all matches start at the same position (by grouping condition), they all share the same prefix. And since they are all matches, they can only differ in length, not in intermediate characters. Thus, comparing the strings is the same as comparing their lengths.
For some reason my machine goes to 100% load and never finishes; I honestly don't understand why.
Probably data dependent.
Regarding your proposal, note that lambda x: len(x)
is just an inefficient spelling for len
. Maybe you meant x[0]
instead of x
?
The grouping condition len(x[0]) + x[1]
seems somewhat arbitrary. It doesn't look like it would fix the case of inner overlap matches.
I added a comment below the recipe in 640a4f418ecaa4e902d8685b21c2d2c4a0b51030 (and changed the indentation, sorry).
Also note that your (2) is a duplicate of #15. I'll close this ticket for now. If you have a better proposal for a longest_match()
recipe (that is actually correct and still does not involve 20+ lines of code), I'd be happy to receive a pull request. Otherwise, i'd just go with "it's just a recipe".
ok, I think I know what the issue is.
there are four-five types of overlaps:
I believe your recipe addresses the first case (string start), whereas mine addresses the second case (string end)
I think it's OK to say the recipe is only a recipe, but actually at least to me looking for longest matches is a core requirement for string search, albeit not necessarily something Aho Corasick algorithm provides out of the box. Neither your solution nor mine solves all of those cases, but rather one would have to check intervals or have a state machine consume overlapping matches to pick out only the longest ones...
I think I have a working solution.
For readability uses namedtuple matches
.
def longest_match(matches):
longest = next(matches)
if not longest:
return
for elt in matches:
# if (a contains b) or (b contains a)
if ( (elt.start >= longest.start and elt.end <= longest.end) or
(longest.start >= elt.start and longest.end <= elt.end)):
longest = max(longest, elt, key=lambda x: x.end - x.start)
else:
yield longest
longest = elt
yield longest
Thanks, but the algorithmic complexity of this code suggests that this a) is too much for a simple recipe and b) needs some serious corner case testing. Meaning, it does not belong into the documentation as it is. Not sure where it might fit better, though... it feels like an actual feature, but... It's not even clear that "officially" including a solution as a feature solves everyone's use cases. What if a user looks for "hallo" and "alloa", and the algorithm finds "halloa"? There is no reasons to assume that users would uniformly prefer one or the other.
(BTW, "if not longest: return" seems surprising. When would that occur? One more corner case to test...)
I have lots of unit tests for this. the case of "if not longest" which I actually rewrote as "if longest is None" occurs when there are no matches. I'll probably create a new library that returns longest matches, using either acora or pyahocorasick as backend. Regards, Johannes
Hi, I'm having issues with your sample code for
longest_match
.Below an implementation that I believe is giving "more correct" results for me...