rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
446 stars 188 forks source link

[FEA] System Memory Resource #1580

Closed rongou closed 1 week ago

rongou commented 4 weeks ago

Motivation

When doing data processing and machine learning on GPUs with large datasets, we often run into out-of-memory errors. Previously there are two solutions:

CUDA 12.2 introduced Heterogeneous Memory Management (HMM) for x86 systems, which extends the unified memory model to include system allocated memory (SAM) using malloc/free. In the Grace Hopper Superchip, SAM support is further enhanced by a fast NVLink-C2C interconnect with Address Translation Services (ATS). Our initial benchmarks show that SAM on Grace Hopper can provide substantial performance benefits when GPU memory is oversubscribed. If we add SAM support to RMM, there would be minimal changes required for libraries that already use RMM to leverage it.

Goals

Non-Goals

This doesn’t have to be a permanent solution. In the future when SAM improves, we may be able to leverage it directly.

Assumptions

HMM requires the following:

Query the Addressing Mode property to verify that HMM is enabled:

$ nvidia-smi -q | grep Addressing
Addressing Mode : HMM

ATS requires the Grace Hopper Superchip:

$ nvidia-smi -q | grep Addressing
    Addressing Mode                       : ATS

Risks

In order to test the new memory resource, we need to update the CI/CD pipeline to at least Turing and a newer open source driver.

Design

There are two issues with using SAM directly when GPU memory is oversubscribed:

To work around these issues, we add two initialization parameters to the memory resource:

To maintain the headroom, we can call cudaMemAdvise with cudaMemAdviseSetPreferredLocation to “pin” the buffer across the GPU/CPU boundary:

GPU portion = free GPU memory - headroom CPU portion = buffer size - GPU portion RMM System Memory Resource

This also solves the problem of page level migration as the system can allocate the GPU memory directly.

Alternatives Considered

Another way to work around the issue of SAM taking up all of GPU memory is to add a swap space, which allows the system to swap out GPU memory pages. Since swapping to disk is very slow, we can create the swap file on a ramdisk. This might be a viable solution in certain cases.