diegorusso commented 1 month ago

Feature or enhancement

Proposal:

The issue https://github.com/python/cpython/issues/116017 explains already what the problem is with memory allocation used by the JIT.

To give more data point, I decided to debug this a little bit further, put some debugging info in the _PyJIT_Compile and then ran a pyperformance run. The debugging info are around the memory allocated and the padding used to align it to the page size. The function has been called 1288249 times and this is the ratio between the actual memory allocated and the padding due to 16K (on MacOS) page size:

Total Padding size: 16,490,764,792
Total Code/Data size: 6,737,241,608

71% of the memory allocated is wasted in padding whilst only 29% is being used by data. There is an indication that memory needed for these objects is usually much smaller than the page size.

This is a brain dump from @brandtbucher to help out with the implementation:

for 3.14 we'll probably need to look into some sort of slab allocator that will let us share pages between executors. We can allocate by either batching the compiles or stopping the world to flip the permission bits, and then deallocate by maintaining refcounts of each page or something. [...] One benefit that could come with an arena allocator is the ability to JIT a bunch of guaranteed-in-range trampolines for long jumps to library/C-API calls, rather than needing to create a ton of redundant in-line trampolines inline in the trace (or using global offset table hacks). That should save us memory and speed things up, I think.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed with Brandt via email and in person at PyCon 2024.

terryjreedy commented 1 month ago

This seems to be the Discourse discussion https://discuss.python.org/t/jit-mapping-bytecode-instructions-and-assembly/50809

diegorusso commented 1 month ago

This seems to be the Discourse discussion https://discuss.python.org/t/jit-mapping-bytecode-instructions-and-assembly/50809

@terryjreedy that Discourse discussion is more for this issue: https://github.com/python/cpython/issues/118467 @tonybaloney did an initial implementation to dump the JIT code of an executor and that discussion is for a proposal to dump the JIT code associated with micro ops.

This issue instead is targeting on how the JIT is allocating memory at runtime. At the moment every object is allocated to a new page, there is a lot of padding for new page alignment.

brandtbucher commented 1 month ago

Thanks for this great summary and issue! Yeah, I think this can progress in a few stages:

Carve out pages from a single large slab of memory (for free-threading-safety reasons we'll probably want to future-proof this by giving each thread its own slab), but still keeping executors on their own pages. The executors free their own pages when they are deallocated, as they do now (this can happen safely in any thread, not just the allocating thread).
Rather than freeing the pages, we may want to reuse them. Could be worth exploring once the allocator exists.
Then, start using the beginning of these slabs for things like trampolines, to reduce duplication amongst traces. We'll probably want some sort of (thread-safe!) refcount on the whole slab to keep from leaking this memory if every trace on the page is freed.
Finally, the hard part: put several executors on the same page. I think batching the compiles makes the most sense, since it means we don't have to stop the world for every compilation, but there might be other schemes that make sense. Then we need to maintain (thread-safe!) refcounts for each page, to make sure the memory is reclaimed once all traces on a page die.

I can get the ball rolling on step one, and then we can iterate from there.

diegorusso commented 1 month ago

Hello, thanks for laying an implementation plan. I was discussing with a colleague and he raised a couple of observations about the last point. An alternative to batching compiles might be to take advantage of hardware features on recent Intel and Apple CPUs that allow multiple threads to have different permissions for the same page. For Intel there is memory protection keys and Apple has pthread_jit_write_protect_np. With this approach jit.c would set its thread permissions for that page range to RW before emitting the code and then toggle it back to RX afterwards, and this wouldn't affect another thread that might be concurrently executing another trace on that page. This also avoid the overhead of calling mprotect for each compile which can be significant if there are many running threads. For systems without these hardware features we could either fall back to allocating JIT memory at a page granularity or perhaps multi-map the JIT pages with separate RW and RX mapping of the same physical pages, the RW mapping would be unmapped after the JIT has finished writing to that page.

Thoughts?

brandtbucher commented 1 month ago

Ah, neat, I didn't know Intel/AMD had hardware protection keys! Sounds like that's a good plan then. I agree that falling back to one trace per page on other platforms makes the most sense.

mdboom commented 1 month ago

take advantage of hardware features on recent Intel and Apple CPUs that allow multiple threads to have different permissions for the same page.

How "recent" are we talking? We should be aware of the additional cost of two behaviors / code paths for this, especially in terms of testing.

python / cpython

JIT: improve memory allocation #119730

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature: