HugePageFiller, how does it do what it does?

google / tcmalloc

Apache License 2.0

4.39k stars 476 forks source link

After a quick email exchange with @ckennelly , he suggested I open an issue about this, so that there is a public trail of the conversation (good call).

So I've been wondering about the implementation of the huge page aware allocator . While the paper explains really well what the code does and why, the how is a bit more mysterious.

So let me ask a few specific question that would help going through the code to understand it.

The allocator organize itself around huge page. When it needs memory from these pages it prioritize using the existing huge pages that already have a lot of allocation, as statistically, they are less likely to to be returned to the OS, and therefore we want to cram as much as possible in there. Another critical parameter is the longest range of unallocated slots withing the huge page as we want to find a huge page with a range of unallocated memory that is big enough, but also generally want to avoid splitting large range of unallocated memory.

It is unclear to me how the selection process works in practice. As far as i can tell, HugePageFiller is responsible for this. It seems to manage an ungodly amount of freelists (512 possible free range length, 64 priority classes). Is it really that simple? Considering there are lists for partially released and releases huge pages too, as well as donated ones, this makes for a ungodly amount of freelists, no?

Isn't it a problem that the huge page are only selected based on their longest run? A page with a high score (as in, unlikely to become empty) and a large free range (plus other smaller free ranges) will not see its smaller free range being allocated before lower priority huge block run out of small free ranges (if they don't have large one). Isn't it a problem? Or is the correlation between "no large free range" and "unlikely to deallocate" good enough so that it works in practice?

Last, but probably not least, I don't understand the return path. When some allocation is freed, it goes through this magical box that is the middle end is completely opaque to me and i can't find a good explanation of what it does and why.

HugePageFiller consulting a bunch of freelists is (approximately) all it does. Few notes, though:

With HintedTrackerList, the prioritization ends up working through the score, so it's easy to look at the minimum scoring, non-empty list of hugepages with score >= N. The available hugepages (represented by PageTracker) are stored on an intrusive linked-list, but the bitmask we maintain means we can go straight to it if there's a match.
We end up having several distinct HintedTrackerList instances, to distinctly prefer a hugepage that has never been split up (into small pages) over one that has been, etc.

In terms of longest free run, this is discussed a bit in our paper, section 4.4:

There end up being 8 priorities ("chunk index"), since we're taking the logarithm.
Longest free run is a hard requirement for an allocation. We have to have at least N contiguous pages to allocate N pages, so it narrows the search dramatically.
For an incoming allocation where we have a choice between two satisifactory hugepages--such as the one with a large LFR (but many allocations increasing its chunk) and one with a smaller LFR (but lower score)--we're aiming for two things
- Minimize fragmentation. We'd like to keep the larger LFR intact for future, large(r) allocations than the one we're making now. We otherwise need another huge page for such an allocation, so fragmenting the larger LFR might be worse in the long run.
- Trying to steer towards hugepages less likely to become free (by avoiding ones more likely to be)
This is just a heuristic though, but it seems to work reasonably well.

In terms of the return path: We're primarily using the address-indexed radix tree (pagemap) to look up information about what huge page an allocation came from. As we finished up the allocation during HugePageAwareAllocator::LockAndAlloc, we updated the radix tree with the info we need for deallocation.

Most allocations are small (Figure 5) for the page heap, so we're going from HugePageAwareAllocator::Delete, obtaining a non-nullptr pt, and then going to HugePageFiller::Put.
We remove it (RemoveFromFillerList) from its current list (with its current LFR/chunk index),
Marking the pages as free (PageTracker::Put), and then
Recompute what priority score it should have to re-index it and add it (AddToFillerList) to the new list for that score.

google / tcmalloc

HugePageFiller, how does it do what it does? #172