katef / libfsm

DFA regular expression library & friends
BSD 2-Clause "Simplified" License
931 stars 52 forks source link

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

Open silentbicycle opened 1 year ago

silentbicycle commented 1 year ago

This adds an extra compilation step for PCRE regexes, generating programs for a group capture resolution abstract machine. The implementation builds on the approach described in "Regular Expression Matching: the Virtual Machine Approach", with adaptations to reduce the runtime memory footprint, handle some edge cases the same way PCRE does, explicitly detect some other PCRE edge cases and reject them as unsupported, and continue supporting group captures as the resulting DFAs are combined with other DFAs into a single DFA that matches them all in one pass

I have fuzzed this code quite a lot, both comparing group capture results to PCRE's and comparing libfsm's handling of individual regexes' DFAs with the DFA produced my combining them. Many of the test cases in tests/capture/capture_test_case_list.c came from fuzzing. My working branch diverged from main for a while during this work, in particular the metadata associated with end states, but after integrating upstream changes I fuzzed it further. I think I integrated things properly, but it's something to be mindful of during review. Some unrelated issues discovered during fuzzing have already been posted as separate PRs.

Worst-case memory usage is proportional to the size of the capvm program's length (known at compile time) and the input length -- we record a bit for each branch taken while evaluating the regex (e.g. either greedily jumping to repeat a subexpression or non-greedily advancing in the regex), so memory usage will slowly increase as inputs are evaluated. Common path prefixes are shared between threads, and the total thread count (and divergence) is bounded by the size of the opcode program. As a special case, a long path prefix of all zero bits is collapsed down to a counter; this usually happens because of an unanchored start loop, where each zero bit represents a path continuing to advance through the input without starting the match yet.

This PR targets an integration branch because some work is not yet complete:

silentbicycle commented 1 year ago

I've merged the current main into this and fuzzed it for another 20 minutes, on 8 CPU cores.