Open pascal-boeschoten-hapteon opened 1 week ago
This error could be asynchronous, coming from a previous failure. Can you please provide more details on the test case, and UCX/Cuda versions?
@yosefe - We (@pascal-boeschoten-hapteon and I) are using UCX 1.17.0 (built from source using the tagged release) alongside CUDA 12.1.105. We encounter the above issue when using MPI_Isend
/MPI_Irecv
, such that from the same rank, some in-flight requests are pointing to buffers located on one GPU, and other requests point to buffers on another GPU. Pseudo-code below:
auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();
If instead, for a given rank, we only use one device at any given time, then the CUDA error disappears and everything works correctly. I.e., the previous pseudo-code would be changed to:
auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
MPI_Waitall();
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();
I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:
This was with OpenMPI 5.0.5 and UCX 1.17. Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with
cudaSetDevice()
? And if so, is there any way to make this work with multiple devices doing transfers in parallel? I also came across a PR that looks like it may fix the issue I'm having: https://github.com/openucx/ucx/pull/9645