re is much slower than cpython

oracle / graalpython

A high-performance embeddable Python 3 runtime for Java

https://www.graalvm.org/python/

Other

1.24k stars 107 forks source link

re is much slower than cpython #445

Open masklinn opened 1 day ago

masklinn commented 1 day ago

I've been adding graal support to a classifier type project naively based on applying a bunch of regexes to an input, and while Graal works the regex application is quite slow: it's about 4x slower than cpython, while using 4 times the CPU.

Here's a repro script and attending data (basically a cut down version of the naive classifier implementation): script.zip

timings:

> python3.12 --version
Python 3.12.6
> time python3.12 run.py
75158 lines in 11.7s
    156.1 us/line
python3.12 run.py  11.77s user 0.05s system 97% cpu 12.172 total
> graalpy --version
GraalPy 3.11.7 (Oracle GraalVM Native 24.1.0)
> time graalpy run.py
75158 lines in 48.2s
    640.7 us/line
graalpy run.py  192.00s user 1.00s system 394% cpu 48.976 total

This is on a 10-core M1 Pro. Using cpusampler I confirmed that essentially all the "user" time is in _sre:

 _search   ||  43580ms  98.6% ||   43580ms  98.6% || <frozen graalpy._sre>~1:0

msimacek commented 1 day ago

Regexes are their own compilation unit, their DFAs are JIT compiled and optimized separately just like python functions. You have 628 regexes, which takes a long time to compile. I got a better time (still worse than CPython) with the JIT compilation for regexes turned off (--experimental-options --engine.CompileOnly='~tregex re'), so it seems that the JIT compilation is too slow for regexes to pay off. I'll ask our regex experts if we can do something about it.

msimacek commented 1 day ago

So I talked to one of the devs of our regex engine. The main problem is that your regexes use a lot of x{n,m} patterns, with large m. Those currently don't allow generating a resonable DFA, so the regex engine has to use a slower fallback execution model. There's currently a person working on supporting these patterns in the faster execution model, so it should get better in the next release.

masklinn commented 1 day ago

Ah I see, I didn't know graal used a dfa internally, that explains things. I've had the same issue in rust as regex is also automata-based, and the way the regexes are written in the core data file also caused trouble (mostly memory, runtime did suffer a bit but not to the same extent).

I improved things by transforming the bounded repetitions back into unbounded before compilation, I'll see if I can do that for graal.

Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?

msimacek commented 1 day ago

Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?

Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking works, but it's a total hack that might break any time)

masklinn commented 1 day ago

Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking works, but it's a total hack that might break any time)

OK I'll go with an implementation check then at least for the time being (assuming the rewriting plan does good).