python / cpython

The Python programming language
https://www.python.org/
Other
60.78k stars 29.34k forks source link

Compiling tiny traces wastes lots of memory #116017

Open brandtbucher opened 4 months ago

brandtbucher commented 4 months ago

The JIT has a minimum allocation granularity of 1 page, with code and data on separate pages. Reducing this minimum would mean having pages of memory shared between executors, which is much harder to manage (and isn't thread-safe).

So the minimum amount of memory owned by an executor is 2 pages. This equals 8KB of memory on most platforms (the notable exception being newer Macs which have 16KB pages, and thus a whopping 32KB minimum allocation per trace). And all of that memory is paged in, since there's at least some memory being used on every page.

We should tune the trace collection machinery to find a sensible minimum trace length for tier two. We might also consider compiling fewer traces.

A separate, but hairier, issue is what to do about the array of cold exit executors. We have 512 of them, and they're one instruction each, adding up to 4MB of mostly empty memory (16MB on macOS). For reference, if optimally packed (i.e. compiled as a single trace of length 512), they would take up about 200KB.

@mdboom has been working on measuring the memory overhead of tier two on our benchmark suite. Early results show that:

(Keep in mind that most of the benchmarks have a tiny memory footprint to begin with, so moderate increases can manifest as large percentages.)

CC: @gvanrossum, @markshannon

Linked PRs

brandtbucher commented 4 months ago

We may also want to consider a scheme for garbage-collecting cold traces. We already have a linked list of executors and a way of invalidating/removing them, so it probably wouldn't be too big of a project.

gvanrossum commented 4 months ago

Yeah, for a hot executor the memory waste feels less of an issue than for a cold executor. Without measuring what's hot, we could at least have separate min-trace-size limits if the executor has a loop in it and if it doesn't. Mark probably has more interesting things to say. Meeting agenda for tomorrow?

brandtbucher commented 4 months ago

Action item for me here is to investigate a more efficient allocation scheme (with multiple executors per page).

savannahostrowski commented 2 months ago

I'll take a look at this ahead of 3.14!

markshannon commented 5 days ago

We will need lots of "tiny traces" (short executors with only a few uops), to link longer traces, form loops, handled specialized entries and exits, etc. So avoiding "tiny" traces is not an option.

The solution to this seems to be compacting traces together during GC. It should give us high density and allow as many "tiny traces" as we need. (This is not my idea, but I don't see it written down anywhere).

@brandtbucher or @savannahostrowski could you write up an issue for compaction (or link to an existing one)?

brandtbucher commented 5 days ago

Yep, I’ll write one up when I get back on Monday.