openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

Open pascal-boeschoten-hapteon opened 1 week ago

pascal-boeschoten-hapteon commented 1 week ago

I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:

cuda_copy_md.c:375  UCX  ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
cuda_copy_md.c:375  UCX  ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
               ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x7f7553400000, length=33554432, access=0xf) failed: Bad address
              ucp_mm.c:70   UCX  ERROR failed to register address 0x7f7553400000 (host) length 33554432 on md[6]=mlx5_bond_0: Input/output error (md supports: host)

This was with OpenMPI 5.0.5 and UCX 1.17. Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with cudaSetDevice()? And if so, is there any way to make this work with multiple devices doing transfers in parallel? I also came across a PR that looks like it may fix the issue I'm having: https://github.com/openucx/ucx/pull/9645

yosefe commented 1 week ago

This error could be asynchronous, coming from a previous failure. Can you please provide more details on the test case, and UCX/Cuda versions?

judicaelclair commented 1 week ago

@yosefe - We (@pascal-boeschoten-hapteon and I) are using UCX 1.17.0 (built from source using the tagged release) alongside CUDA 12.1.105. We encounter the above issue when using MPI_Isend/MPI_Irecv, such that from the same rank, some in-flight requests are pointing to buffers located on one GPU, and other requests point to buffers on another GPU. Pseudo-code below:

auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();

If instead, for a given rank, we only use one device at any given time, then the CUDA error disappears and everything works correctly. I.e., the previous pseudo-code would be changed to:

auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
MPI_Waitall();
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();