Open masklinn opened 1 day ago
Regexes are their own compilation unit, their DFAs are JIT compiled and optimized separately just like python functions. You have 628 regexes, which takes a long time to compile. I got a better time (still worse than CPython) with the JIT compilation for regexes turned off (--experimental-options --engine.CompileOnly='~tregex re'
), so it seems that the JIT compilation is too slow for regexes to pay off. I'll ask our regex experts if we can do something about it.
So I talked to one of the devs of our regex engine. The main problem is that your regexes use a lot of x{n,m}
patterns, with large m
. Those currently don't allow generating a resonable DFA, so the regex engine has to use a slower fallback execution model. There's currently a person working on supporting these patterns in the faster execution model, so it should get better in the next release.
Ah I see, I didn't know graal used a dfa internally, that explains things. I've had the same issue in rust as regex
is also automata-based, and the way the regexes are written in the core data file also caused trouble (mostly memory, runtime did suffer a bit but not to the same extent).
I improved things by transforming the bounded repetitions back into unbounded before compilation, I'll see if I can do that for graal.
Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?
Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?
Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking
works, but it's a total hack that might break any time)
Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (
_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking
works, but it's a total hack that might break any time)
OK I'll go with an implementation check then at least for the time being (assuming the rewriting plan does good).
I've been adding graal support to a classifier type project naively based on applying a bunch of regexes to an input, and while Graal works the regex application is quite slow: it's about 4x slower than cpython, while using 4 times the CPU.
Here's a repro script and attending data (basically a cut down version of the naive classifier implementation): script.zip
timings:
This is on a 10-core M1 Pro. Using cpusampler I confirmed that essentially all the "user" time is in
_sre
: