diegorusso commented 1 month ago

Feature or enhancement

Proposal:

This is really a follow up of https://github.com/python/cpython/issues/115802 and more focused on the AArch64 improvements of the code generated for the JIT. This has been discussed with @brandtbucher during PyCon 2024.

There are a series of incremental improvements that we could implement when generating AArch64 code:

[x] Remove duplication of trampoline section (movk) at the end of every micro op assembly code.

    // 0000000000000140:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 144: f2a00008      movk    x8, #0x0, lsl #16
    // 0000000000000144:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 148: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000148:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 14c: f2e00008      movk    x8, #0x0, lsl #48
    // 000000000000014c:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 150: d61f0100      br      x8
    // 154: 00 00 00 00
    // 158: d2800008      mov     x8, #0x0
    // 0000000000000158:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 15c: f2a00008      movk    x8, #0x0, lsl #16
    // 000000000000015c:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 160: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000160:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 164: f2e00008      movk    x8, #0x0, lsl #48
    // 0000000000000164:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 168: d61f0100      br      x8

[ ] Implement trampoline with LDR of a PC relative literal (instead of movk). It saves 8bytes in code size.
[ ] Move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
[ ] Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline.
[ ] Once we have a slab allocator from https://github.com/python/cpython/issues/119730, a PR use one set of trampolines per-slab rather than per-trace.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed broadly at PyCon 2024 in person.

Linked PRs

gh-120250
gh-121001

brandtbucher commented 1 month ago

Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

Generate trampoline at the end of the trace instead of at the end of every micro op and write a function to generate the trampoline.

I'd break this up into a couple of phases:

A PR to move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
A PR to emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses (this could waste some memory initially, but is a nice intermediate step).
Once we have a slab allocator from GH-119730, a PR use one set of trampolines per-slab rather than per-trace.

Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice.

diegorusso commented 4 weeks ago

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four.

The code will be something like that:

ldr x8, [PC+8]
br x8
&_Py_Dealloc

So this work should also benefit other platforms too, which is nice.

Of course :)

python / cpython

JIT: improve AArch64 code generation #119726

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs