openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

[CUDA] Support IPC for allocations created by `cuMemCreate` and `cudaMallocAsync` #7110

Open vchuravy opened 3 years ago

vchuravy commented 3 years ago

Describe the bug

CUDA 10.2 introduced a new set of memory allocation routines (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA) which allow for pooled allocation and stream based allocation.

These allocation do not support cuIpcGetMemHandle as noted in https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

The new CUDA virtual memory management functions do not support the legacy cuIpc* functions with their memory. Instead, they expose a new mechanism for interprocess communication that works better with each supported platform. This new mechanism is based on manipulating system–specific handles. On Windows, these are of type HANDLE or D3DKMT_HANDLE, while on Linux-based platforms, these are file descriptors.

To get one of these operating system–specific handles, the new function cuMemExportToShareableHandle is introduced. The appropriate request handle types must be passed to cuMemCreate. By default, memory is not exportable, so shareable handles are not available with the default properties.

It seems to be that CUDA 11.2 introduced cudaMallocAsync is using this new interface under the hood as https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g8158cc4b2c0d2c2c771f9d1af3cf386e takes a HandleType https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gabde707dfb8a602b917e0b177f77f365

Steps to Reproduce

See https://github.com/JuliaGPU/CUDA.jl/issues/1053 for an application failure caused by this.

The error encountered is:

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
Akshay-Venkatesh commented 3 years ago

@vchuravy Is staging cuMemCreate/MallocAsync through cuMemAlloc/cudaMalloc memory not an option? Is it true that https://github.com/JuliaGPU/CUDA.jl/issues/1053 strictly needs to use cuMemCreate/MallocAsync ?

vchuravy commented 3 years ago

Two notes:

  1. I am only 90% sure that cudaMallocAsync uses cuMemCreate and had to infer that from the surrounding documentation.
  2. That kinda highlights the point, the user doesn't necessarily know or needs to know what allocation method was used.

From the perspective of CUDA.jl we currently do not expose the different allocators to the user, the only option is whether the user configures the use of a memory pool managed by CUDA.jl or via cudaMallocAsync thus managed by the driver.

Now we currently have the work around for users who want to use UCX or MPI to disable the use of cudaMallocAsync. On an application level staging through cudaMalloc might be a possibility as well, but introduces additional complexities. (Dealing with provenance e.g. who allocated the buffer, which method, allocation of unnecessary temporary memory...)

From my perspective as a user of MPI or UCX I would like to see support for cudaMallocAsync since they can be IPC capable.

There seem to be two relevant pointer attributes:

CU_POINTER_ATTRIBUTE_IS_LEGACY_CUDA_IPC_CAPABLE
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES
vchuravy commented 2 years ago

This remains an issue https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060/20?u=vchuravy and we have to tell users to explicitly disable CUDA mempool's support.

jrhemstad commented 2 years ago

cudaMallocAsync supports CUDA IPC, but requires configuring an explicit pool handle.

See the "Interprocess communication support" section here: https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/

pentschev commented 2 years ago

From the discussions I had with @Akshay-Venkatesh , it seems using an explicit pool handle for CUDA IPC may not be possible in UCX at the moment, but that will probably be possible in protov2. Meanwhile, support for cudaMallocAsync has been added in https://github.com/openucx/ucx/pull/8623, and given the lack of direct support for CUDA IPC, one intermediate solution is to use staging buffers by setting UCX_RNDV_FRAG_MEM_TYPE=cuda, from our preliminary performance tests in UCX-Py we were able to reach about 90% of CUDA IPC performance when compared to default CUDA pinned memory, with the advantage of being able to prevent fragmentation. We still have some open issues though: https://github.com/openucx/ucx/issues/8639 https://github.com/openucx/ucx/issues/8669 , those still prevent us from using async memory allocations for specific use cases.