lefticus / cpp_weekly

The official C++ Weekly Repository. Code samples and notes of future / past episodes will land here at various times. PR's will be accepted in some cases.
The Unlicense
700 stars 26 forks source link

C++ vs Python vs JIT'd Python vs Python + C++ #223

Closed lefticus closed 1 year ago

lefticus commented 1 year ago

Examples: https://github.com/lefticus/cpp_weekly/tree/master/python/conway_game_of_life

jpivarski commented 1 year ago

There were some comments on today's video about Numba, since Numba fills out the JIT space between PyPy and cppyy by being Python syntax and yet statically typed.

I'm a big Numba fan, so I was nerd sniped into providing a demonstration for the same problem that you used to demonstrate C++, cppyy, CPython, and PyPy. It's also what I'd consider more idiomatic Python code. (The approach based on lookup tables and manually indexing a 2D array are not things I'd consider "normal" in Python.)

My Numba version is: https://gist.github.com/jpivarski/63ef88be8bd41d9ed24971b86824d189 (Short story: 115× speedup by adding a one-line decorator!)

But the comment keeps getting deleted from YouTube. I don't think I'm breaking any rules—even a version of the comment without a URL gets deleted. Is that on purpose? Do you know about that?

Arguably, I should consider the time I spent on it (an hour) a sunk cost and not keep checking back, but this is a form of procrastination. I'm supposed to be doing some administrative work right now, but writing and rewriting Conway's Game of Life is much more fun.

lefticus commented 1 year ago

I definitely do not delete comments on YouTube unless they are clearly abuse or spam (and sometimes I leave the more innocent spam, like links to other channels). So that is all YouTube's doing.

I'll add a link back to your Numba version. That's interesting that it would be that much faster than pypy as well. I'm clearly just learning much of this myself.

wlav commented 1 year ago

@jpivarski, in your example, you take the JITing out of the main, timed, loop, which is different from how the scripts are run in the video. Doing so shows a greater speedup, for all JITs.

And yes, I need to get C++ classes in Numba working. ;)

As for PyPy/cppyy failing to compile the C++ code, can you @lefticus post which backend was used? If using pip, what does python -m pip show cppyy-cling give you? (Unfortunately, I've been lacking time to keep PyPy/_cppyy up to date, and it could very well be that you have LLVM5 (!) there.)

lefticus commented 1 year ago

python -m pip show cppyy-cling

image

jpivarski commented 1 year ago

I definitely do not delete comments on YouTube unless they are clearly abuse or spam

In that case, it must have been an automated YouTube algorithm. Maybe it thought I edited my comment too many times, and then I got on its bad list or something.

That's interesting that it would be that much faster than pypy as well.

There's a 3-way tie among JIT compilers in the scientific Python world:

That's how I like to describe it. You can get 2 out of 3 of: writing the hot code in Python, using the whole Python language, and running at the speed of C.

@jpivarski, in your example, you take the JITing out of the main, timed, loop, which is different from how the scripts are run in the video. Doing so shows a greater speedup, for all JITs.

In that case, the Numba version is 2.4 seconds = 2.2 seconds of compilation + 0.2 seconds of running.

wlav commented 1 year ago

Hmm, that's LLVM9 in both cases.

I just tried and I do not see the C++ error, so bit of a puzzle then. :)

I do get the same Python error, which is b/c the CPython version of cppyy automatically cooks up constructors for aggregates, whereas the PyPy version does not. So, I modified the code to use a little helper:

def create_point(x, y):
    p = cppyy.gbl.Automata.Point()
    p.x = x
    p.y = y
    return p

and then it ran fine for me.

The improvement is marginal, however, only 2%. This should be no surprise, since if you see a significant speedup in CPython, then the Python code is not the bottle-neck, and that is the only part that pypy can improve upon (well, and call overhead to cross the language barrier).

@jpivarski pypy does not compile the whole language, it can fall back onto anything for which it does not have specializations. This unfortunately also means that unless you're dealing with numeric code (which is well understood and thus easy to JIT), even small modifications can seriously slow down a pypy run.

jpivarski commented 1 year ago

What I meant by "whole language" is that PyPy doesn't give up/raise an error when you give it valid Python that it can't optimize—it implements the whole language.

Maybe in some tight loop, PyPy can prove that x will always be an integer, and it will optimize that path like a hotspot JIT-compiler, but—exactly what you're saying—if obj is an object and it can't always prove that obj.x will be an integer, it has to fall back to a less optimized path.

This is a trade-off that the Numba group had to make (between @nb.jit(nopython=False) and @nb.jit(nopython=True)), between raising errors, and therefore informing the user that they're not getting what they want, or trying to run anyway, but slower. They're deciding more in favor of giving up early (@nb.jit(nopython=False) is being deprecated).

wlav commented 1 year ago

One minor point there: PyPy does not need to prove everything in a trace: it has guards at the start of a trace and on each predicate. So eg. if the trace depends on obj.x being an integer type, it will be checked going into the trace, then assumed inside it. If it isn't an integer, normal interpretation restarts until the new trace on the other type turns hot. This allows PyPy to cover a larger set of cases then Numba. E.g. a trace will remove all branches that are not taken, and those branches can be full of the most dynamic of Python (e.g. for error handling/recovery).

In the specific case of cppyy, it's even better, as obj.x will have a fixed type and offset if obj is a C++ instance, in which case PyPy can drop the guards: it only needs to verify that obj is still the expected C++ type when the trace is re-entered.

And yes, your third point is what I was referring to earlier. To expand on my example: if code from a branch that is not taken in a trace is moved to a branch that is taken (not a crazy thing to do), you can all of a sudden get a slowdown and this without warning.

ebarlas commented 1 year ago

Was Cython considered for the comparison? It seems like it would fit right in with the spectrum highlighted in the video.

https://cython.org/

lefticus commented 1 year ago

Was Cython considered for the comparison? It seems like it would fit right in with the spectrum highlighted in the video.

https://cython.org/

It was not - largely because I'm a noob to this.

lefticus commented 1 year ago

Closing since the episode has aired, but feel free to still discuss the topic here.

wlav commented 1 year ago

cython is a pseudo-Python language to write C extension modules; and as part of the transliteration down to C (and a smidgen of C++ code), it generates Python bindings, too. But the result is then compiled and linked normally; it is not a JIT, and as such, it serves a different purpose. Since it can also export the generated C code as functions with external linkage, cython modules can be used quite cleanly with Numba.

Aside, there's a discussion in the youtube comments that went a bit of the rails (and is why I have no interest in wading into it), but here's a paper that shows to value of a JIT for performance reasons, which should settle the argument: https://arxiv.org/pdf/1904.08555.pdf . In short, there are highly optimized templated C++ libraries (EIgen, CUTLASS, Mat-X, etc.) that work best when input types are known and the fixed-size implementations easily outperform the generic versions. If the data sizes aren't known until run-time, the JIT will smoke the AoT compiler. The results of that paper (which tweaked Clang) are easily reproduced with cppyy/Cling. Note also that for many HPC codes, it is common to do a bit of profiling on startup to pick optimal algorithms (e.g. for all-to-all communication, which is strongly affected by the architecture), so a bit of JITing at startup is a perfectly acceptable approach in such an environment.

The alternative is to compile all likely sizes and data types offline into a fat library. This is done e.g. in cuBLAS, but can lead to serious performance bugs if a type/size combo happens to not be available.