intel / GPGMM

GPGMM, a General-Purpose GPU Memory Management Library.
Apache License 2.0
32 stars 7 forks source link

Support pooled dynamic memory allocation for CPU allocations. #321

Open bbernhar opened 2 years ago

bbernhar commented 2 years ago

Replace new and delete to allocate and de-allocate, respectively, through a allocator that re-cycles CPU heaps.

bjjones commented 2 years ago

@bbernhar So I'm confused - the prompt you emailed me said we needed to implement a dynamic allocator, but after looking at the code, we already have the SlabMemoryAllocator, which appears to meet the requirements. Is there something I'm missing here? Why wouldn't we just use this?

bbernhar commented 2 years ago

Sounds like it.

This issue wants to amortize dynamic allocations from the C++ runtime, created by GPGMM itself. The SlabMemoryAllocator targets GPU device memory and cannot malloc/free from C++ runtime heaps.

In theory it could, but a GPU allocator operates under such unique constraints (ex. cannot embedded metadata freely) that the overlap here doesn't extend very far past the allocation technique itself.

Hopefully that makes it more clear.

bjjones commented 2 years ago

So in general for this project we need to improve 1) allocation locality to decrease the amount of pointer chasing and 2) reuse memory to reduce the cost of allocation. Here's a list of all the problem areas I'm aware of due categorized by the container used:

Chromium LinkedList/LinkNode

  • ResidencyManager: Heap Type
  • SlabMemoryAllocator: Slab Type
  • AllocateNode: AllocatorNode Type
  • SegmentedMemoryAllocator: MemorySegment Type

std::set

  • ResidencySet: UnderlyingType Type

std::unordered_map

  • EventTraceWriter: std::unordered_map<std:: thread::id, ScopedTraceBufferInTLS*> // maybe not that important

std::unordered_set

  • MemoryCache: Cache Type

It's been suggested I should try to use std::pmr as a simple method for improving allocation locality, however to use std::pmr, we must use an STL container. This is fine for the STL containers we're using, but we also use Chromium's LinkedList implementation, which is heavily integrated into a few complex areas of GPGMM. After researching the prospect of migrating LinkedList/LinkNode to an STL container I feel a non-PMR-based custom allocator is the best option for improving the allocation behavior. The primary motivation for using std::pmr from my understanding is ease-of-use, but when it requires changing the data structure that 4 complex systems are built around - that benefit isn't very persuasive.

I think we should just go ahead with using the slab allocator from Dawn.

bbernhar commented 2 years ago

When you say "use std::pmr", would we literally just replace either std:: containers and non-std:: containers, no matter how they're used?

Also, you say (1) is important (I agree) but how does that get addressed?

bjjones commented 2 years ago

When you say "use std::pmr", would we literally just replace either std:: containers and non-std:: containers, no matter how they're used?

To "use std::pmr" I mean to replace a data structure with the pmr version of the appropriate STL container (example: std::pmr::set), and define how the container is allocates/deallocates with the facilities provided by the language. We shouldn't just blindly change all of these over, but any containers creating persistent objects should be considered. If we are using a non-STL container, like our LinkedList/LinkNode, we do not have the option to use the pmr allocate/deallocate facilities, so we must port the custom container to an STL container and use pmr, or we can implement our own allocator.

Also, you say (1) is important (I agree) but how does that get addressed?

Using any kind of pooled memory with memory reuse should adequately accomplish this. Since I'm suggesting we continue to use a linked list, the best we can reasonably do is ensure all nodes are allocated contiguously without fragmentation. We could probably do better if the nodes were sorted (possibly by using a vector instead), but I don't think the implementation cost is worthwhile and I suspect the additional CPU overhead is non-negliglbe.

bjjones commented 2 years ago

There’s another solution that I might be getting to - but my inexperience with pmr is making me unsure how feasible it is: allocating LinkNodes in a std::pmr::vector and implementing a custom std::pmr::memory_resource to fix the memory locations (keeping our vector element pointers fixed) and instead just allocate another block of memory when the vector needs resized. I’m unsure how possible this is, but it may be what Udaya is eluding to.

bbernhar commented 2 years ago

What happens if nodes are removed/inserted in the middle somewhere, are we just going to keep re-sorting?

ResidencyManager doesn't need random insertion across the whole list, only by set. SlabMemoryAllocator doesn't need to maintain any order of slabs, so why keep using a list? AllocateNode doesn't need random insertion/removal either, it used like a stack. Finally, SegmentedMemoryAllocator does need order but O(N) random insertion per segment isn't a big deal if N stays small. Truth be told, there is very little reason to use a list. It is mostly there for convenience.

I don't think it would be very hard to change. Already familiar with residency LRU and (soon) slab allocation. The others seem trivial. Could split this work into creating a custom allocator first, then removing LinkedList to STL which could re-use as std::allocator or std::pmr (your choice). This could address the issue for (1).

bjjones commented 2 years ago

What happens if nodes are removed/inserted in the middle somewhere, are we just going to keep re-sorting?

That's the point of keeping this a list - you don't have to keep it sorted. This is the reason I don't want to use a vector.

ResidencyManager doesn't need random insertion across the whole list

But it does need random deletion. Upon every command list submission we have to remove every resource-to-be-used that exists in the list and move it to the front. With a vector that means reallocation and copying for every resource update.

bbernhar commented 2 years ago

The key part was "across the whole list". You could just remove any heap within the set by swapping the last heap with the removed - requires no reallocation. Reallocation only occurs per set, which could be stored contiguously.

bjjones commented 2 years ago

I don't understand. The operation I'm referring to is moving a heap from the middle of the LRU Cache to the front of the LRU when it is used. What you've described does not seem to do that when the LRU is represented by a std::vector- as swapping anything into the spot of the removed heap would cause the LRU to lose order. It would be acceptable to shift all elements by one from the front to the removed element though.

If we're reallocating the entire LRU per residency set, is that still an acceptable design? The initial requirements were to reuse memory.

Udaya has not commented at all upon the idea of using a custom allocator even though I've brought it up multiple times. I'm not sure why - it seems like the easiest solution to me.

bjjones commented 2 years ago

I'm starting to understand how to use PMR to make my own allocator - this seems like what Udaya was hinting at and it seems like keeping LinkedList/LinkNode untouched would be possible if I allocate everything this way. The main trick is using monotonic_buffer_resource with a pmr::vector will ensure pointers to vector locations remain stable until the allocator is destroyed.

Here's some very rough code I wrote up to demonstrate how I think would work:

#include <memory_resource>
#include <vector>

template <typename T>
class VectorBasedPoolAllocator {
    public:
        VectorBasedPoolAllocator(size_t blockSize) {
            mBlockSize = blockSize;
            mObjectsPerBlock = blockSize / sizeOf(T);
        };

        T* Allocate(T object) {
            // If a node exists on free list, reuse the free spot
            if(mFreeList.empty()) {
                std::pair<uint64_t, uint64_t> freeSpot = mFreeList[mFreeList.size()-1];
                mFreeList.pop_back()
                mAllocations[freeSpot.first][freeSpot.second] = object;
                return &mAllocations[freeSpot.first][freeSpot.second];
            }

            // If our next allocation offset starts with a zero, we haven't allocated the block yet. Allocate it.
            if(mNextAllocation.second == 0) {
                mMemoryResources.push_back(new monotonic_buffer_resource(mBlockSize));
                // Create a vector with fixed memory
                mAllocations.push_back(new std::vector<T>(mMemoryResources[mNextAllocation.first]));
            }

            mAllocations[mNextAllocation.first][mNextAllocation.second].push_back(object);

            T* retPtr = &mAllocations[mNextAllocation.first][mNextAllocation.second];

            // Increment next allocation position
            mNextAllocation.second++;
            if(mNextAllocation.second == mObjectsPerBlock) {
                mNextAllocation.first = 0;
                mNextAllocation.second = 0;
            }

            return retPtr;
        };

        void Deallocate(T* objectPointer) {
            // Translate T* pointer to [pool][offset] somehow - 
            // Zero the position
            // Insert [pool][offset] into mFreeList
        };

    private:
        std::vector<std::pmr::vector<T>> mAllocations;
        std::vector<monotonic_buffer_resource> mMemoryResources;
        size_t mBlockSize;
        uint64_t mObjectsPerBlock;
        std::pair<uint64_t, uint64_t> mNextAllocation = {0,0};
        std::vector<std::pair<uint64_t, uint64_t>> mFreeList;
};
bbernhar commented 2 years ago

You can move any heap within the same set and the LRU would not lose order. If you move a heap anywhere from the LRU, the LRU needs to maintain set-order. This means you chase 1/N pointers around (N = set size) every time you insert or remove a heap into the LRU. In practice, the residency LRU usually only accesses the current working set (always in the front) or removes a bunch (always from the end), not so much randomly in the middle. A deque of sets would probably be much faster.

bjjones commented 2 years ago

There are quite a few aspects to this proposed implementation that I don't understand. Probably best to have a call about it. At the moment I'm going to be focused on the custom allocator described in above https://github.com/intel/GPGMM/issues/321#issuecomment-1188425008

bbernhar commented 2 years ago

Unless your free-list maintains order, the randomly de-allocated nodes will not be stored contiguously. Not unless the data structure allocating them is also contiguous (ex. array, vector, dequeue) and not linked. This remains an issue. I'm fine with just assuming a contiguous data structure is used, so we could just pool or pre-allocate (slab) from (don't worry about replacing LinkedList atm).

bjjones commented 2 years ago

Ran into an issue with my allocator implementation. To use std::pmr I need to #include . libc++ doesn't have a released implementation of this library. They do have experimental/memory_resource, which would work - but I need to modify buildtools/third_party/libc++/BUILD.gn to include it.

1) Is there any issue using an experimental library? It is part of the C++17 standard and like very other STL implementation includes it.

2) Is this something I can write a build_override for?

bbernhar commented 2 years ago

I don't think we can modify/override buildtools because then third_party/gpgmm (or non-standalone) would not build, unless DEPS gets rolled with your change (and build_override is modified in parents, if needed). That's fine for standalone GPGMM, though.

Is there any reason the allocator couldn't work with any (STL or non-STL) allocator type?

bjjones commented 2 years ago

Is there any reason the allocator couldn't work with any (STL or non-STL) allocator type?

I guess I can't think of one. I was using pmr to ensure vectors weren't reallocated upon push_back()...but if we just use reserve() (or just use std::array) it shouldn't matter.

I do think this interferes with my plans to use std::pmr versions of set and unordered_map. With pmr we would be able to implement custom allocators for the STL containers that would allocate in a fixed block of contiguous memory to reduce paging during iteration and enable memory reuse. I don't think we can change how STL containers are allocated without pmr.

bbernhar commented 2 years ago

You can specify a custom allocator with non-std::pmr STL containers too, but it would be statically typed (replaces default std::allocator). There would just be no dynamic dispatch or common allocator type. Does that matter to you?

bjjones commented 2 years ago

That's probably fine since there's only couple instances where the STL containers are being used.