Open diegorusso opened 1 month ago
This seems to be the Discourse discussion https://discuss.python.org/t/jit-mapping-bytecode-instructions-and-assembly/50809
This seems to be the Discourse discussion https://discuss.python.org/t/jit-mapping-bytecode-instructions-and-assembly/50809
@terryjreedy that Discourse discussion is more for this issue: https://github.com/python/cpython/issues/118467 @tonybaloney did an initial implementation to dump the JIT code of an executor and that discussion is for a proposal to dump the JIT code associated with micro ops.
This issue instead is targeting on how the JIT is allocating memory at runtime. At the moment every object is allocated to a new page, there is a lot of padding for new page alignment.
Thanks for this great summary and issue! Yeah, I think this can progress in a few stages:
I can get the ball rolling on step one, and then we can iterate from there.
Hello, thanks for laying an implementation plan. I was discussing with a colleague and he raised a couple of observations about the last point.
An alternative to batching compiles might be to take advantage of hardware features on recent Intel and Apple CPUs that allow multiple threads to have different permissions for the same page. For Intel there is memory protection keys and Apple has pthread_jit_write_protect_np.
With this approach jit.c
would set its thread permissions for that page range to RW before emitting the code and then toggle it back to RX afterwards, and this wouldn't affect another thread that might be concurrently executing another trace on that page. This also avoid the overhead of calling mprotect for each compile which can be significant if there are many running threads.
For systems without these hardware features we could either fall back to allocating JIT memory at a page granularity or perhaps multi-map the JIT pages with separate RW and RX mapping of the same physical pages, the RW mapping would be unmapped after the JIT has finished writing to that page.
Thoughts?
Ah, neat, I didn't know Intel/AMD had hardware protection keys! Sounds like that's a good plan then. I agree that falling back to one trace per page on other platforms makes the most sense.
take advantage of hardware features on recent Intel and Apple CPUs that allow multiple threads to have different permissions for the same page.
How "recent" are we talking? We should be aware of the additional cost of two behaviors / code paths for this, especially in terms of testing.
Feature or enhancement
Proposal:
The issue https://github.com/python/cpython/issues/116017 explains already what the problem is with memory allocation used by the JIT.
To give more data point, I decided to debug this a little bit further, put some debugging info in the
_PyJIT_Compile
and then ran a pyperformance run. The debugging info are around the memory allocated and the padding used to align it to the page size. The function has been called 1288249 times and this is the ratio between the actual memory allocated and the padding due to 16K (on MacOS) page size:71% of the memory allocated is wasted in padding whilst only 29% is being used by data. There is an indication that memory needed for these objects is usually much smaller than the page size.
This is a brain dump from @brandtbucher to help out with the implementation:
Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
This has been discussed with Brandt via email and in person at PyCon 2024.