When doing data processing and machine learning on GPUs with large datasets, we often run into out-of-memory errors. Previously there are two solutions:
Use unified memory (cudaMallocManaged/cudaFree), which is a single memory address space accessible from any processor in a system. However, since data migration is triggered by page faults and is done at the page granularity, performance is usually not that great when GPU memory is oversubscribed.
Manually implement a custom spilling solution. This can be complicated and has to be done for every library.
CUDA 12.2 introduced Heterogeneous Memory Management (HMM) for x86 systems, which extends the unified memory model to include system allocated memory (SAM) using malloc/free. In the Grace Hopper Superchip, SAM support is further enhanced by a fast NVLink-C2C interconnect with Address Translation Services (ATS). Our initial benchmarks show that SAM on Grace Hopper can provide substantial performance benefits when GPU memory is oversubscribed. If we add SAM support to RMM, there would be minimal changes required for libraries that already use RMM to leverage it.
Goals
Implement an RMM device memory resource that uses system allocated memory.
Work around existing limitations of SAM (see below in design).
Non-Goals
This doesn’t have to be a permanent solution. In the future when SAM improves, we may be able to leverage it directly.
Assumptions
HMM requires the following:
NVIDIA CUDA 12.2 with the open-source r535_00 driver or newer.
A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.
A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.
A 64-bit x86 CPU.
Query the Addressing Mode property to verify that HMM is enabled:
In order to test the new memory resource, we need to update the CI/CD pipeline to at least Turing and a newer open source driver.
Design
There are two issues with using SAM directly when GPU memory is oversubscribed:
Currently a SAM buffer can only migrate one way: from the CPU to the GPU. If we allocate a SAM buffer larger than the remaining free GPU memory, and use it in a kernel, it’ll take up all the free memory. Subsequently if we make CUDA calls (e.g. manipulating CUDA events/streams, or initializing cuBLAS) that require some amount of GPU memory, these calls will fail.
Migration is still triggered by page faults and can be slow if compute density is not very high.
To work around these issues, we add two initialization parameters to the memory resource:
Headroom: the amount of GPU memory to leave for other CUDA calls. When allocating large buffers, we can check how much free memory is left and make sure the given headroom amount is reserved for other CUDA calls.
Threshold: the size of the requested buffer above which to check for headroom. Since checking for free memory can be expensive, we can skip it when allocating small buffers below the threshold.
To maintain the headroom, we can call cudaMemAdvise with cudaMemAdviseSetPreferredLocation to “pin” the buffer across the GPU/CPU boundary:
This also solves the problem of page level migration as the system can allocate the GPU memory directly.
Alternatives Considered
Another way to work around the issue of SAM taking up all of GPU memory is to add a swap space, which allows the system to swap out GPU memory pages. Since swapping to disk is very slow, we can create the swap file on a ramdisk. This might be a viable solution in certain cases.
Motivation
When doing data processing and machine learning on GPUs with large datasets, we often run into out-of-memory errors. Previously there are two solutions:
cudaMallocManaged
/cudaFree
), which is a single memory address space accessible from any processor in a system. However, since data migration is triggered by page faults and is done at the page granularity, performance is usually not that great when GPU memory is oversubscribed.CUDA 12.2 introduced Heterogeneous Memory Management (HMM) for x86 systems, which extends the unified memory model to include system allocated memory (SAM) using
malloc
/free
. In the Grace Hopper Superchip, SAM support is further enhanced by a fast NVLink-C2C interconnect with Address Translation Services (ATS). Our initial benchmarks show that SAM on Grace Hopper can provide substantial performance benefits when GPU memory is oversubscribed. If we add SAM support to RMM, there would be minimal changes required for libraries that already use RMM to leverage it.Goals
Non-Goals
This doesn’t have to be a permanent solution. In the future when SAM improves, we may be able to leverage it directly.
Assumptions
HMM requires the following:
Query the Addressing Mode property to verify that HMM is enabled:
ATS requires the Grace Hopper Superchip:
Risks
In order to test the new memory resource, we need to update the CI/CD pipeline to at least Turing and a newer open source driver.
Design
There are two issues with using SAM directly when GPU memory is oversubscribed:
To work around these issues, we add two initialization parameters to the memory resource:
Headroom
: the amount of GPU memory to leave for other CUDA calls. When allocating large buffers, we can check how much free memory is left and make sure the given headroom amount is reserved for other CUDA calls.Threshold
: the size of the requested buffer above which to check for headroom. Since checking for free memory can be expensive, we can skip it when allocating small buffers below the threshold.To maintain the headroom, we can call
cudaMemAdvise
withcudaMemAdviseSetPreferredLocation
to “pin” the buffer across the GPU/CPU boundary:GPU portion = free GPU memory - headroom CPU portion = buffer size - GPU portion
This also solves the problem of page level migration as the system can allocate the GPU memory directly.
Alternatives Considered
Another way to work around the issue of SAM taking up all of GPU memory is to add a swap space, which allows the system to swap out GPU memory pages. Since swapping to disk is very slow, we can create the swap file on a ramdisk. This might be a viable solution in certain cases.