python / cpython

The Python programming language
https://www.python.org/
Other
60.87k stars 29.38k forks source link

Improving JIT code quality #115802

Open brandtbucher opened 4 months ago

brandtbucher commented 4 months ago

We have a JIT compiler, but the quality of the templates needs improvement. Here are some areas that I plan on working on (note that these strictly pertain to machine code quality, not improvements to tier two in general):

The nice thing about these improvements is that they compound. For example, getting rid of 64-bit register-indirect jumps (the "large" code model) in the first step will make all of the following improvements even more potent .

Stop using the "large" code model.

Since our jitted code is mapped into a 64-bit address space, we're not able to compile our templates using the default "small" code model that assumes a 31-bit address space (and can thus use 32-bit relative addressing). This forces the templates to use convoluted code patterns to make sure that this 64-bit assumption is met for all relocations, typically using two large instructions (load the wide value into a register, then actually use it) where only one should be necessary:

Example of current code.

Unfortunately, many of the relocations we perform are values that can comfortably fit in 32 signed bits, meaning that this code bloat is pure unnecessary overhead in critical parts of the code.

What we can do instead is generate position-independent code, which is already the default for most platforms. On the surface, this seems less efficient, since it means that every value we patch will be loaded indirectly from memory (even though the code itself is smaller):

Example of position-independent code.

However, notice that the relocations that are emitted are relaxable. This means that if the oparg value and continuation address are small enough, we can (quite easily) relax mov eax, dword ptr [rip + AAA] into lea eax, [rip + XXX] and jmp qword ptr [rip + BBB] into jmp YYY; nop. This turns both memory loads into signed 32-bit PC-relative immediate values, and gives us some good quality code:

8D 05 XX XX XX XX lea eax, [rip + XXXXXXXX]
01 07             add dword ptr [rdi], eax
E9 YY YY YY YY    jmp YYYYYYYY
90                nop

Since macOS JIT builds already use position-independent code, this isn't very difficult to implement. The two main "new" bits are:

(As a bonus, we can probably use something similar to PLT stubs to relax all jumps and calls without too much fuss, since we already emit these sorts of trampolines for aarch64-unknown-linux-gnu builds. I haven't looked into this too closely to see what it will win us performance-wise.)

Benchmarking runs suggest that this change will give us a ~1-3% performance improvement, depending on platform.

Remove zero-length jumps when tail-calling between templates.

This is easier once we've performed the above step, since we no longer need to worry about two-instruction register-indirect jumps (which are tricky to elide correctly). In the above code, we can trivially remove the jmp and nop instructions, giving us:

8D 05 XX XX XX XX lea eax, [rip + XXXXXXXX]
01 07             add dword ptr [rdi], eax

Benchmarking suggests that this change will give us an additional ~1-3% performance improvement.

Use LLVM's ghccc/preserve_none calling convention when tail-calling between templates.

LLVM has cool calling conventions that have no callee-saved registers, and pass as many arguments in registers as possible. This is super efficient for our tail calls, and is a big part of the original copy-and-patch paper.

Clang 19 will ship with support for this __attribute__((preserve_none)) calling convention, but not until much later this year. In the meantime, hacks exist (by dropping briefly into LLVM IR at build time and changing the convention manually).

Benchmarking suggests that this change will give us an additional ~0-6% performance improvement (no wins on aarch64, moderate wins on x86_64, and big wins on i686).

Split hot and cold parts of the templates.

Our stencils contain a lot of error-handling and deoptimization code. We can make the hot code more compact and linear by "jumping away" to shared cold stubs for these paths (and these jumps will be more efficient with the above improvements).

This isn't too difficult, since we already have two "segments" of jitted code (executable code and read-only data). This just adds a third.

Early experiments suggest a ~1-2% performance improvement with this approach.

Cache top-of-stack values in registers when tail-calling between templates.

See this paper for inspiration.

This is a bigger project, but will likely be one of the bigger wins here. It is probably made much easier by our fancy interpreter generator, which should be a decent framework for generating variants of instructions with certain inputs and outputs living outside of the value stack, and generating lookup tables for automatically replacing them during trace formation.

The above step of changing the calling convention will be helpful here, since we'll get a lot more registers to pin stack items in across tail calls. We can probably do this in a way that works for the tier two interpreter as well.

No idea what this will win us, but it should be good given the amount of memory traffic that's being reduced to and from interpreter frames.

Linked PRs

brandtbucher commented 4 months ago

Upon discussion with @markshannon, it seems like both hot-cold splitting and top-of-stack caching can be done in the tier two IR, with minimal JIT surgery. That seems preferable, since it keeps the JIT simpler, the optimizer may be able to do something with it, and the same schemes can be used in the tier two interpreter. We'll probably need similar integration for frame reconstruction anyways.