[FEA] System Memory Resource

Motivation

When doing data processing and machine learning on GPUs with large datasets, we often run into out-of-memory errors. Previously there are two solutions:

Use unified memory (cudaMallocManaged/cudaFree), which is a single memory address space accessible from any processor in a system. However, since data migration is triggered by page faults and is done at the page granularity, performance is usually not that great when GPU memory is oversubscribed.
Manually implement a custom spilling solution. This can be complicated and has to be done for every library.

CUDA 12.2 introduced Heterogeneous Memory Management (HMM) for x86 systems, which extends the unified memory model to include system allocated memory (SAM) using malloc/free. In the Grace Hopper Superchip, SAM support is further enhanced by a fast NVLink-C2C interconnect with Address Translation Services (ATS). Our initial benchmarks show that SAM on Grace Hopper can provide substantial performance benefits when GPU memory is oversubscribed. If we add SAM support to RMM, there would be minimal changes required for libraries that already use RMM to leverage it.

Goals

Implement an RMM device memory resource that uses system allocated memory.
Work around existing limitations of SAM (see below in design).

Non-Goals

This doesn’t have to be a permanent solution. In the future when SAM improves, we may be able to leverage it directly.

Assumptions

HMM requires the following:

NVIDIA CUDA 12.2 with the open-source r535_00 driver or newer.
A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.
A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.
A 64-bit x86 CPU.

Query the Addressing Mode property to verify that HMM is enabled:

$ nvidia-smi -q | grep Addressing
Addressing Mode : HMM

ATS requires the Grace Hopper Superchip:

$ nvidia-smi -q | grep Addressing
    Addressing Mode                       : ATS

Risks

In order to test the new memory resource, we need to update the CI/CD pipeline to at least Turing and a newer open source driver.

Design

There are two issues with using SAM directly when GPU memory is oversubscribed:

Currently a SAM buffer can only migrate one way: from the CPU to the GPU. If we allocate a SAM buffer larger than the remaining free GPU memory, and use it in a kernel, it’ll take up all the free memory. Subsequently if we make CUDA calls (e.g. manipulating CUDA events/streams, or initializing cuBLAS) that require some amount of GPU memory, these calls will fail.
Migration is still triggered by page faults and can be slow if compute density is not very high.

To work around these issues, we add two initialization parameters to the memory resource:

Headroom: the amount of GPU memory to leave for other CUDA calls. When allocating large buffers, we can check how much free memory is left and make sure the given headroom amount is reserved for other CUDA calls.
Threshold: the size of the requested buffer above which to check for headroom. Since checking for free memory can be expensive, we can skip it when allocating small buffers below the threshold.

To maintain the headroom, we can call cudaMemAdvise with cudaMemAdviseSetPreferredLocation to “pin” the buffer across the GPU/CPU boundary:

GPU portion = free GPU memory - headroom CPU portion = buffer size - GPU portion RMM System Memory Resource

This also solves the problem of page level migration as the system can allocate the GPU memory directly.

Alternatives Considered

Another way to work around the issue of SAM taking up all of GPU memory is to add a swap space, which allows the system to swap out GPU memory pages. Since swapping to disk is very slow, we can create the swap file on a ramdisk. This might be a viable solution in certain cases.

rapidsai / rmm