Eager Decommits in Medium Sized Slabs

SchrodingerZhu commented 2 years ago

This is reported by one of my friends (@misakikasumi). In his application, he used snmalloc to manage 16bit 1080P frame buffers, which takes 1/4 of the medium sized slab. Based on his observation, this layout actually causes a very frequent de-committing pattern and introduces a bit slow down on Windows platform comparing with system allocators or mimalloc.

By design, LocalCache only stores free lists for small size classes. If it is possible to introduce some additional bounded free lists for relatively large objects (so such we can avoid eager de-commmitting)?

@misakikasumi would you like to share more details?

ghost commented 2 years ago

@SchrodingerZhu Thank you for open this issue. I would like to mention some details about my situation. I'm doing multi-threading video processing and typical allocations are 1080p 16bit frame buffers (or more specifically, buffers of frame planes; 4,147,200 bytes for each). There are also frame buffers of other byte lengths. In my application, these frame buffers are allocated in arbitrary worker threads and deallocated in one GC thread. Considering the performance of returning free spaces from GC thread to worker threads in deallocation, I have adopted snmalloc as the allocator of frame buffers.

I first tried snmalloc1. Because the default config of chunk size of snmalloc is not big enough to deal with frame buffers, I have configured snmalloc to use 16MB large chunks. Under this configuration, one mediumslab can hold four 16bit 1080p frame buffers; it can hold fewer frame buffers with larger size. In current implementation of snmalloc, one mediumslab is deallocated once it becomes empty, and goes into the global large object stack with synchronization. Given that in GC, frame buffers are deallocated in batch, and one mediumslab can only hold four, it is expected that mediumslabs will frequently become empty and trigger large deallocation. This scenario is not unique to my application—compared with superslabs, mediumslabs can contain fewer objects because these objects are larger, so mediumslabs are easier to become empty in nature, if medium objects are alloc/dealloc frequently.

I also tried snmalloc2. I have configured snmalloc to use 16MB large buddy. It decommits a 16MB block once it become empty, resulting in worse performance.

My idea is that maybe we can add a freelist of mediumslabs (or large buddy blocks) with limited capacity. When a mediumslab becomes empty, it can first go into this local freelist instead of triggering immediate global large deallocation/decommits.

nwf commented 2 years ago

I think (but @mjp41 is actually the authority here) that, for snmalloc2, there are several things conspiring to hurt performance here: not only does it notify_not_using everything exiting the per-thread large buddy allocator and entering the backend global (synchronized) large buddy allocator, it also disassociates the per-thread allocator from that address space at the same time, which will require (here) updating 16M/MIN_CHUNK_SIZE = 16M/16K = 1024 pagemap entries. That's a fairly sizeable critical section and so is likely to increase contention on the GlobaRange's flag lock, which, being as it's just a spinlock, probably isn't coping particularly well.

I can think of a few options that might be worth pursuing:

Introduce a LargeMagazine range type that just holds on to a few (16M) regions before forwarding them to its parent range, and use this in the frontend allocators, between their SmallBuddyRange and LargeBuddyRange.
Introduce a notion of "slop" into the LargeBuddyRange where it can (optionally) hold on to ranges larger than what it would usually push up to its parent.
Revisit the decision to put the RemoteAllocator* directly in the pagemap and instead have it inside the associated Metaslab structure. This would serve to reduce the O(chunks) work done under the GlobalRange flag lock to O(1), but it comes at the cost of an additional pointer indirection on the dealloc fast path.

This is not meant to be an exclusive list, just food for thought.

mjp41 commented 2 years ago

@nwf the pagemap is updated with the BackendAllocator, which does not hold the global lock.

My plan was the use the code in ChunkAllocator. This currently performs a time based scrubbing of chunks. This code was not refactored with the move to Ranges, but needs to be for this precise case. It would keep a local cache of larger sizes, and only push them centrally once a certain amount of time has passed. I think this would fit this use case perfectly.

ghost commented 2 years ago

BTW, I'm measuring the cost of decommits on Windows and trying to find alternatives. I tested snmalloc2 with 4 different PALs.

Decommit

static void notify_not_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));

    BOOL ok = VirtualFree(p, size, MEM_DECOMMIT);

    if (!ok)
    error("VirtualFree failed");
}

No-op

static void notify_not_using(void* p, size_t size) noexcept {}

Reset

static void notify_not_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));

    void* r = VirtualAlloc(p, size, MEM_RESET, PAGE_READWRITE);

    if (r == nullptr)
      error("VirtualAlloc failed");
}

Offer & Reclaim

static void notify_not_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));

    auto rv = OfferVirtualMemory(p, size, VmOfferPriorityNormal);

    if (rv != ERROR_SUCCESS)
    error("OfferVirtualMemory failed");
}

template<ZeroMem zero_mem>
static void notify_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(
    is_aligned_block<page_size>(p, size) || (zero_mem == NoZero));

    auto rv = ReclaimVirtualMemory(p, size);
    if (rv == ERROR_SUCCESS || rv == ERROR_BUSY)
      return;

    void* r = VirtualAlloc(p, size, MEM_COMMIT, PAGE_READWRITE);

    if (r == nullptr)
    report_fatal_error(
        "out of memory: {} ({}) could not be committed", p, size);
}

I measured performance (in my application, frames processed per second) and page faults (10000 frames). The result is interesting:

PAL	FPS	Page Faults
Decommit	340.21	12,557,692
No-op	382.05	141,524
Reset	365.97	151,490
Offer & Reclaim	246.59	17,417,188

I'm surprised that offer & reclaim has very bad performance.

microsoft / snmalloc

Eager Decommits in Medium Sized Slabs #484