Open da-woods opened 3 months ago
Profiling didn't reveal too much. It's spending a large chunk of time in _visitchildren
in TreeVisitor
in Visitor.py
, but that's not unexpected.
There's somewhere where we use
child_attrs = property(fget=operator.attrgetter('subexprs'))
#instead of
# @property
# def child_attrs(self):
# return self.subexprs
changing that made things a bit faster, but not dramatically so. And that's as far as I got
GraalVM seems to have an option --cpusampler
to produce profiles, including flame graphs. Maybe that can bring up some hints?
https://www.graalvm.org/latest/tools/profiling/
GraalVM seems to have an option
--cpusampler
to produce profiles, including flame graphs. Maybe that can bring up some hints?
Yes I gave those a quick go - they were what pointed out operator.attrgetter
. That was the only thing that really stood out as unexpected.
I've attached some example output though
I've improved things on our CI by turning off the JIT with the options --experimental-options --engine.Compilation=false
, which seems to make things both faster, and single-core.
But we're clearly doing something what doesn't agree with how GraalPython optimizes things.
If turning off the JIT helps, then it sounds like a deoptimization loop bug (in graalpy). You're most likely doing nothing wrong (unless you're constantly generating new code and evaling it). I'll try to investigate.
Thanks. I don't think it's eval
/exec
- we use them but very infrequently and the parts they're in don't show up on the profile.
Quick warning - if you do pip install cython
I think it will compile itself. This report is just about running it without compiling it. That's easiest to get just by cloning the git repo but NO_CYTHON_COMPILE=true pip install cython
also works.
if you do
pip install cython
I think it will compile itself
It should actually use the Python-any wheel that we distribute on PyPI, i.e. not try to build anything locally.
I've been working on getting GraalPython tested on the Cython CI. It mostly works but it's really slow.
One aspect of this is the time spent running Cython itself. Note that this is pure Python code (so it doesn't involve any interaction with your C API emulation, which I know isn't considered a fast path) - while Cython has the option of compiling itself for speed I haven't done so here for the sake of the report.
For the sake of a demo I've just done checked out the cython repository from github and done
that just runs cython on a bunch of its own files (but only to the c code generation stage, it doesn't invoke any C compilers).
Some results:
The upshot is that GraalPython is about 8 times slower than CPython, (and also uses 3 cores of my CPU most of that time while CPython is largely single-threaded).
I've included PyPy just as another data-point. It's also slower for this case (although not quite as slow as GraalPython) so we're clearly doing something that isn't JIT friendly....
I haven't done any profiling beyond this basic measurement (yet).
I do realise this is essentially an enormous code-dump with the complaint "it's slow", which is never a style of bug report that I'm very impressed with when I'm on the receiving end.