timbray / quamina

Home of Quamina, a fast pattern-matching library in Go
Apache License 2.0
373 stars 18 forks source link

addresses issue #188 #314

Closed timbray closed 1 month ago

timbray commented 1 month ago

prevent state explosions with epsilon transitions

The introduction of proper NFA epsilon transitions - see https://swtch.com/~rsc/regexp/regexp1.html - totally fixed this problem. The most dramatic illustration is line 96 of shell_style_test.go. Previously, you could only use 20 or so shell_style patterns before an O(2**N) explosion leading to millions of states. Now you can add all 12K in the test file. The addPattern() runs almost instantly, although matching slows down, a fair trade-off.

Once this is landed, we can start implementing all the cool stuff that was-ruler has without worry about exploding NFAs.

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 95.23810% with 9 lines in your changes missing coverage. Please review.

Project coverage is 96.10%. Comparing base (5eee82d) to head (37b33ab).

Files Patch % Lines
prettyprinter.go 70.58% 4 Missing and 1 partial :warning:
nfa.go 94.00% 2 Missing and 1 partial :warning:
core_matcher.go 92.30% 1 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #314 +/- ## ========================================== - Coverage 96.39% 96.10% -0.30% ========================================== Files 18 18 Lines 1718 1744 +26 ========================================== + Hits 1656 1676 +20 - Misses 35 40 +5 - Partials 27 28 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

timbray commented 1 month ago

Oh drat, I somehow missed go test -v -race -count=1 looks like a real race condition. Please ignore this PR for now.

timbray commented 1 month ago

You may have noticed a bit of thrashing in this PR

There's an important trade-off to consider.

Performance

To summarize, #314 cleanly kills issue #188 which means that you can freely use "*" globs, as many as you want, and you won't get an O(2**N) explosion in your memory size and addPattern() latency. I'll also add the ability to have multiple wildcards in a single pattern and that will be fine too. Now we have a clear path to doing full regular expressions in Quamina.

However, there is a cost. #314 adds "epsilon transitions" and multi-state traversal to our automata, which solves a lot of problems but has a performance cost. The latest versions, compared to the current main branch on GitHub, is 20% slower on exact-match, 15% slower on prefix, and 10% slower on anything-but. (The most recent push is much worse but I've fixed that in my version, back to around 20%). I care about performance a lot, this is bad.

On the other hand:

  1. It's still a lot faster than aws/event-ruler
  2. It still does millions of matches/second
  3. The new epsilon-transition path hasn't received the kind of intense optimization that the existing one has.

Question for the audience: Should we go ahead and accept the performance penalty in exchange for opening the gates to lots of nice features? Is there anyone now using Quamina who's going to be troubled by a ~20% performance hit?

jsmorph commented 1 month ago

That performance hit won't cause me any heartburn.

Does this work help at all with a future numeric implementation?

timbray commented 1 month ago

Does this work help at all with a future numeric implementation?

I'm not sure. Because of the way smallTables work, I don't think there any need for nondeterministic automata to do that. Which in fact we managed to do with Ruler back in the day.

timbray commented 1 month ago

Hearing no objections… Once again, I acknowledge that potential reviewers are probably not eager about digging through finite-automata theory, so unless someone screams STOP I'll push this Monday.