scoder / acora

Fast multi-keyword search engine for text strings
http://pypi.python.org/pypi/acora
BSD 3-Clause "New" or "Revised" License
247 stars 17 forks source link

longest match #17

Closed codinguncut closed 7 years ago

codinguncut commented 7 years ago

Hi, I'm having issues with your sample code for longest_match.

  1. It doesn't seem to generate correct results for me.
    • It seems to group by string start rather than string end
    • It uses the maximum by string comparison, rather than maximum length
  2. For some reason my machine goes to 100% load and never finishes; I honestly don't understand why.

Below an implementation that I believe is giving "more correct" results for me...

def search_longest(matches):
         for pos, match_set in groupby(matches, lambda x: len(x[0]) + x[1]):
             yield max(match_set, key=lambda x: len(x))
scoder commented 7 years ago

Thanks for the input. I admit that it's a quick and dirty recipe that comes with implicit caveats.

It seems to group by string start rather than string end

Yes, that's what it does. Should imply that shorter matches occur before longer matches. Didn't test inner overlaps, though. The result might look funny in that case, e.g. if you search for "hallo", "hall" and "al" at the same time, you would probably get three matches rather than two.

It uses the maximum by string comparison, rather than maximum length

Since all matches start at the same position (by grouping condition), they all share the same prefix. And since they are all matches, they can only differ in length, not in intermediate characters. Thus, comparing the strings is the same as comparing their lengths.

For some reason my machine goes to 100% load and never finishes; I honestly don't understand why.

Probably data dependent.

Regarding your proposal, note that lambda x: len(x) is just an inefficient spelling for len. Maybe you meant x[0] instead of x? The grouping condition len(x[0]) + x[1] seems somewhat arbitrary. It doesn't look like it would fix the case of inner overlap matches.

scoder commented 7 years ago

I added a comment below the recipe in 640a4f418ecaa4e902d8685b21c2d2c4a0b51030 (and changed the indentation, sorry).

scoder commented 7 years ago

Also note that your (2) is a duplicate of #15. I'll close this ticket for now. If you have a better proposal for a longest_match() recipe (that is actually correct and still does not involve 20+ lines of code), I'd be happy to receive a pull request. Otherwise, i'd just go with "it's just a recipe".

codinguncut commented 7 years ago

ok, I think I know what the issue is.

there are four-five types of overlaps:

I believe your recipe addresses the first case (string start), whereas mine addresses the second case (string end)

I think it's OK to say the recipe is only a recipe, but actually at least to me looking for longest matches is a core requirement for string search, albeit not necessarily something Aho Corasick algorithm provides out of the box. Neither your solution nor mine solves all of those cases, but rather one would have to check intervals or have a state machine consume overlapping matches to pick out only the longest ones...

codinguncut commented 7 years ago

I think I have a working solution. For readability uses namedtuple matches.

def longest_match(matches):
    longest = next(matches)
    if not longest:
        return

    for elt in matches:
        # if (a contains b) or (b contains a)
        if (    (elt.start >= longest.start and elt.end <= longest.end) or
                (longest.start >= elt.start and longest.end <= elt.end)):
            longest = max(longest, elt, key=lambda x: x.end - x.start)
        else:
            yield longest
            longest = elt
    yield longest
scoder commented 7 years ago

Thanks, but the algorithmic complexity of this code suggests that this a) is too much for a simple recipe and b) needs some serious corner case testing. Meaning, it does not belong into the documentation as it is. Not sure where it might fit better, though... it feels like an actual feature, but... It's not even clear that "officially" including a solution as a feature solves everyone's use cases. What if a user looks for "hallo" and "alloa", and the algorithm finds "halloa"? There is no reasons to assume that users would uniformly prefer one or the other.

(BTW, "if not longest: return" seems surprising. When would that occur? One more corner case to test...)

codinguncut commented 7 years ago

I have lots of unit tests for this. the case of "if not longest" which I actually rewrote as "if longest is None" occurs when there are no matches. I'll probably create a new library that returns longest matches, using either acora or pyahocorasick as backend. Regards, Johannes