Open woodruffw opened 3 years ago
I don't have much experience in this space. Perhaps we have a macro/configuration to switch between different regex implementations.
However, we need to ensure that the synthesiser/interpreter use the same regex implementation when configured accordingly.
We've currently got a variant of souffle using PCRE2. Might be worth doing a breakdown of alternatives or, as scholz suggested, cutting that corner of the system out to make it easy to swap/customise.
Yeah, here are the big alternatives that I'm aware of:
std::regex
implementation.std::regex
's problems.In general, being able to swap between engines would be nice -- my team's particular use case would benefit heavily from RE2's architecture, but we obviously don't want to hamper other's uses of Souffle by limiting the kinds of pattern expressions that are supported. I don't have a lot of cycles for this at the moment, but it's something I could take a stab at in the medium term via a configuration option/featute macro, as @b-scholz suggested.
Is regexes ever precompiled or are they always compiled at runtime in Souffle? I guess this should also play a role in which regex engine would be suitable. If I recall correctly re2 does not do any NFA -> DFA conversions. I would assume that std::regex does if the string is available at compile time. I'm not sure though.
my team's particular use case would benefit heavily from RE2's architecture
@woodruffw Just out of curiosity, would you be able to explain your use case a bit more?
I would assume that std::regex does if the string is available at compile time. I'm not sure though.
If only we were so lucky :slightly_smiling_face: -- Hana Dusíková has a proposal for compile-time evaluation for std::regex
, but it's not slated until C++23 at the earliest. The language of the proposal suggests that every current implementation of std::regex
is 100% runtime (if only by convenience, not necessarily by spec).
Just out of curiosity, would you be able to explain your use case a bit more?
Some members of my team are working on whole-program pointer analysis, including on programs with pathological search spaces (think hundreds of globally reachable large, multi-dimensional arrays).
Some of our Souffle rules include very basic regular expressions, where (sub)string matching doesn't quite cut it. Some of these pathological inputs actually cause std::regex
to segfault due to a stack overflow, since std::regex
's implementation is naively recursive. RE2 doesn't have this problem thanks to its FSM.
If only we were so lucky -- Hana Dusíková has a proposal for compile-time evaluation for
std::regex
, but it's not slated until C++23 at the earliest. The language of the proposal suggests that every current implementation ofstd::regex
is 100% runtime (if only by convenience, not necessarily by spec).
Yikes. Seems like I really overestimated the state of std::regex
.
Some of our Souffle rules include very basic regular expressions, where (sub)string matching doesn't quite cut it. Some of these pathological inputs actually cause
std::regex
to segfault due to a stack overflow, sincestd::regex
's implementation is naively recursive. RE2 doesn't have this problem thanks to its FSM.
Interesting. Thanks!
Is regexes ever precompiled or are they always compiled at runtime in Souffle?
Ideally we'd have precompiled for literals + an LRU cache for dynamic expressions. Our fork has a hacky wrapper around SymTbl that memoises PCRE2 compiled expressions. Works well enough for our uses (no dynamic regexs) but it'd be nice to be able to evict unused FSMs/JITs.
an LRU cache for dynamic expressions
Yeah, this is another source of performance woes with the current implementation -- std::regex
objects are initialized on demand, which might mean (tens of) millions of identical allocations for a rule that's hit frequently.
Another alternative is to implement a functor library with external functors for regex.
The current functor interface won't really do. We've extended the regex intrinsics full atoms and support iteration over capture groups (e.g. a(x, i) :- regex_capture(" a bc d ", "\\W(\\w+)\\W", i, offset).
-> ("a", 1), ("bc", 2), ("d", 3)
). Are there others who'd see serious use for a full FFI-for-atoms-with-optional-indexing?
Hi there!
I noticed that the C++ synthesized by Souffle uses
std::regex
, which can stack overflow (at least withlibstdc++
's implementation) due to excessive backtracking:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61582
Here's the relevant synthesis fragment:
https://github.com/souffle-lang/souffle/blob/d9209ddda1b831659494f015638956dfb618ac98/src/synthesiser/Synthesiser.cpp#L2409-L2417
I'm not familiar with Souffle's regex engine requirements off the top of my head but, assuming that they don't include lots of extended (i.e., not purely regular) functionality, it might be possible to swap a different regex engine into
regex_wrapper
based on a build-time option. If so, re2 is a reasonable candidate (it doesn't perform backtracking and is well-tested).