oracle / graalpython

GraalPy – A high-performance embeddable Python 3 runtime for Java
https://www.graalvm.org/python/
Other
1.24k stars 108 forks source link

[performance] GraalPython is slow when running Cython #411

Open da-woods opened 3 months ago

da-woods commented 3 months ago

I've been working on getting GraalPython tested on the Cython CI. It mostly works but it's really slow.

One aspect of this is the time spent running Cython itself. Note that this is pure Python code (so it doesn't involve any interaction with your C API emulation, which I know isn't considered a fast path) - while Cython has the option of compiling itself for speed I haven't done so here for the sake of the report.

For the sake of a demo I've just done checked out the cython repository from github and done

time python cython.py Cython/Compiler/*.py

that just runs cython on a bunch of its own files (but only to the c code generation stage, it doesn't invoke any C compilers).

Some results:

Python 3.11.9
-----------
real    1m3.896s
user    0m55.934s
sys     0m4.580s

GraalPython (from the file "graalpy-24.0.2-linux-amd64.tar.gz" from your releases page)
Python 3.10.13 (Thu Jul 04 12:42:45 UTC 2024)
[Graal, Oracle GraalVM, Java 22.0.2] on linux
--------------
real    8m2.008s
user    21m20.609s
sys     0m19.100s

PyPy (pypy3.10-v7.3.12-linux64)
---------------------------------------------
real    4m18.502s
user    4m10.389s
sys     0m0.938s

The upshot is that GraalPython is about 8 times slower than CPython, (and also uses 3 cores of my CPU most of that time while CPython is largely single-threaded).

I've included PyPy just as another data-point. It's also slower for this case (although not quite as slow as GraalPython) so we're clearly doing something that isn't JIT friendly....

I haven't done any profiling beyond this basic measurement (yet).


I do realise this is essentially an enormous code-dump with the complaint "it's slow", which is never a style of bug report that I'm very impressed with when I'm on the receiving end.

da-woods commented 3 months ago

Profiling didn't reveal too much. It's spending a large chunk of time in _visitchildren in TreeVisitor in Visitor.py, but that's not unexpected.

There's somewhere where we use

child_attrs = property(fget=operator.attrgetter('subexprs'))
#instead of 
# @property
# def child_attrs(self):
#    return self.subexprs

changing that made things a bit faster, but not dramatically so. And that's as far as I got

scoder commented 3 months ago

GraalVM seems to have an option --cpusampler to produce profiles, including flame graphs. Maybe that can bring up some hints? https://www.graalvm.org/latest/tools/profiling/

da-woods commented 3 months ago

GraalVM seems to have an option --cpusampler to produce profiles, including flame graphs. Maybe that can bring up some hints?

Yes I gave those a quick go - they were what pointed out operator.attrgetter. That was the only thing that really stood out as unexpected.

I've attached some example output though

graalcpusample.txt flamegraph.svg

da-woods commented 3 months ago

I've improved things on our CI by turning off the JIT with the options --experimental-options --engine.Compilation=false, which seems to make things both faster, and single-core.

But we're clearly doing something what doesn't agree with how GraalPython optimizes things.

msimacek commented 3 months ago

If turning off the JIT helps, then it sounds like a deoptimization loop bug (in graalpy). You're most likely doing nothing wrong (unless you're constantly generating new code and evaling it). I'll try to investigate.

da-woods commented 3 months ago

Thanks. I don't think it's eval/exec - we use them but very infrequently and the parts they're in don't show up on the profile.

Quick warning - if you do pip install cython I think it will compile itself. This report is just about running it without compiling it. That's easiest to get just by cloning the git repo but NO_CYTHON_COMPILE=true pip install cython also works.

scoder commented 3 months ago

if you do pip install cython I think it will compile itself

It should actually use the Python-any wheel that we distribute on PyPI, i.e. not try to build anything locally.