Open diegorusso opened 1 month ago
Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?
Implement trampoline with LDR of a PC relative literal (instead of movk). It saves
Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.
Generate trampoline at the end of the trace instead of at the end of every micro op and write a function to generate the trampoline.
I'd break this up into a couple of phases:
Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice.
Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.
I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four.
The code will be something like that:
ldr x8, [PC+8]
br x8
&_Py_Dealloc
So this work should also benefit other platforms too, which is nice.
Of course :)
Feature or enhancement
Proposal:
This is really a follow up of https://github.com/python/cpython/issues/115802 and more focused on the AArch64 improvements of the code generated for the JIT. This has been discussed with @brandtbucher during PyCon 2024.
There are a series of incremental improvements that we could implement when generating AArch64 code:
Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
This has been discussed broadly at PyCon 2024 in person.
Linked PRs