NCCL and CUDA-aware MPI?

cadedaniel commented 3 years ago

Background information

Hi team, thanks for your work on OpenMPI.

I am trying to use NCCL concurrently with the CUDA-aware OpenMPI. NCCL makes a careful note in its documentation that this scenario will cause hangs, unless global synchronization is introduced.

Context on why NCCL+CUDA-aware OpenMPI hangs

The global synchronization is required because NCCL places blocking CUDA kernels on GPUs which block until all other GPUs have the kernel launched. This causes hangs if, for example, another NCCL communication occurs on the same set of GPUs. See the following example:

Communication A between GPUs: (0, 1)
Communication B between GPUs: (0, 1)

(GPU0) launch kernel for communication A
(GPU1) launch kernel for communication B # Deadlock!
(GPU1) launch kernel for communication A
(GPU0) launch kernel for communication B

For this same reason, global synchronization is required when using NCCL and the CUDA-aware OpenMPI; my understanding is not as solid here, but I believe OpenMPI may place such inter-device dependencies concurrently with NCCL, causing a form of the deadlock described above. One way to solve this problem is with "communication epochs", i.e. mutually exclusive periodic windows where NCCL and CUDA-aware OpenMPI can each execute.

My question

With this context, my question is this: Is it possible to implement a CUDA "communication epoch" with CUDA-aware OpenMPI, so that all CUDA-related calls by OpenMPI are limited to within this window?

Naively, one could say to simply not make any MPI calls outside of the designated communication epoch. But, this ignores the possibility of MPI background threads making CUDA calls, which would potentially violate mutual exclusivity and cause hangs in NCCL. Is this a real concern, or is my naive solution actually viable?

Is there any other documentation or information that would help me in better understanding the limitations of CUDA-aware OpenMPI+NCCL? I've read some issues like https://github.com/open-mpi/ompi/issues/7733, which was actually useful at understanding fundamentals.

What version of Open MPI are you using?

OpenMPI v4.0.1, NCCL 2.7

Describe how Open MPI was installed

From source, currently without CUDA-aware flags.

Please describe the system on which you are running

Operating system/version: Docker Ubuntu, but can vary.
Computer hardware: AWS GPU instances (p3dn, p4d, etc).
Network type: EFA

sjeaugey commented 3 years ago

I don't think that Open MPI would make any CUDA call if there is no active MPI requests. I.e. waiting for all MPI request to complete should be sufficient. @Akshay-Venkatesh to confirm.

bosilca commented 3 years ago

If I'm not mistaken, epoch here is used in the sense of sequential ordering of between libraries changes, in other terms when your application switch between a NCCL time-based context to an OMPI time-based context (or vice-versa) no lingering communications should remain. @sjeaugey is correct, OMPI will not issue any GPU kernel without active communications to or from GPU memory, so a safe epoch is something you can implement at the user-level by completing all GPU-based communications (some form of MPI_Wait) before switching from OMPI to NCCL.

Akshay-Venkatesh commented 3 years ago

@cadedaniel Do you see hangs when you use non-overlapping NCCL and CUDA-aware MPI epochs?

open-mpi / ompi