This adds an extra compilation step for PCRE regexes, generating programs for a group capture resolution abstract machine. The implementation builds on the approach described in "Regular Expression Matching: the Virtual Machine Approach", with adaptations to reduce the runtime memory footprint, handle some edge cases the same way PCRE does, explicitly detect some other PCRE edge cases and reject them as unsupported, and continue supporting group captures as the resulting DFAs are combined with other DFAs into a single DFA that matches them all in one pass
I have fuzzed this code quite a lot, both comparing group capture results to PCRE's and comparing libfsm's handling of individual regexes' DFAs with the DFA produced my combining them. Many of the test cases in tests/capture/capture_test_case_list.c came from fuzzing. My working branch diverged from main for a while during this work, in particular the metadata associated with end states, but after integrating upstream changes I fuzzed it further. I think I integrated things properly, but it's something to be mindful of during review. Some unrelated issues discovered during fuzzing have already been posted as separate PRs.
Worst-case memory usage is proportional to the size of the capvm program's length (known at compile time) and the input length -- we record a bit for each branch taken while evaluating the regex (e.g. either greedily jumping to repeat a subexpression or non-greedily advancing in the regex), so memory usage will slowly increase as inputs are evaluated. Common path prefixes are shared between threads, and the total thread count (and divergence) is bounded by the size of the opcode program. As a special case, a long path prefix of all zero bits is collapsed down to a counter; this usually happens because of an unanchored start loop, where each zero bit represents a path continuing to advance through the input without starting the match yet.
This PR targets an integration branch because some work is not yet complete:
Some quick cleanup of CI issues -- the submodule's build config is more strict about variables that are set but unused (because their use is in logging code that compiles away).
A few pathological cases having to do with repetition in combination with anchors are rejected as unsupported, such as ^(()($)|x)+$. These are probably not worth the trouble to support. It also doesn't support the \z operator yet.
Some of the combinations of CLI flags for re aren't enabled yet, such as capture resolution for multiple files, and the re docs need updating for some new flags.
We should be able to calculate likely memory usage info at regex compile time, which could be used by the caller to stack-allocate and reuse a preallocated buffer rather than using dynamic allocation at runtime, but I haven't implemented that yet. The memory usage should small in practice, but it would be nice to completely eliminate runtime dynamic allocation. The path metadata table currently grows on demand, the other data structures are either fixed-size or the same length as the opcode count. Cases where it would hit the caller's memory limit for growing the path table are likely to hit PCRE's limits much sooner.
Code generation does not output the capture resolution abstract machine's opcodes yet. When using capture handling, libfsm's caller will need to either link a small implementation of the abstract machine or have its code generation produce a standalone implementation, but because that brings some architectural trade-offs I wanted to wait until this round has been reviewed.
This adds an extra compilation step for PCRE regexes, generating programs for a group capture resolution abstract machine. The implementation builds on the approach described in "Regular Expression Matching: the Virtual Machine Approach", with adaptations to reduce the runtime memory footprint, handle some edge cases the same way PCRE does, explicitly detect some other PCRE edge cases and reject them as unsupported, and continue supporting group captures as the resulting DFAs are combined with other DFAs into a single DFA that matches them all in one pass
I have fuzzed this code quite a lot, both comparing group capture results to PCRE's and comparing libfsm's handling of individual regexes' DFAs with the DFA produced my combining them. Many of the test cases in
tests/capture/capture_test_case_list.c
came from fuzzing. My working branch diverged frommain
for a while during this work, in particular the metadata associated with end states, but after integrating upstream changes I fuzzed it further. I think I integrated things properly, but it's something to be mindful of during review. Some unrelated issues discovered during fuzzing have already been posted as separate PRs.Worst-case memory usage is proportional to the size of the capvm program's length (known at compile time) and the input length -- we record a bit for each branch taken while evaluating the regex (e.g. either greedily jumping to repeat a subexpression or non-greedily advancing in the regex), so memory usage will slowly increase as inputs are evaluated. Common path prefixes are shared between threads, and the total thread count (and divergence) is bounded by the size of the opcode program. As a special case, a long path prefix of all zero bits is collapsed down to a counter; this usually happens because of an unanchored start loop, where each zero bit represents a path continuing to advance through the input without starting the match yet.
This PR targets an integration branch because some work is not yet complete:
^(()($)|x)+$
. These are probably not worth the trouble to support. It also doesn't support the\z
operator yet.re
aren't enabled yet, such as capture resolution for multiple files, and there
docs need updating for some new flags.