Open wenduwan opened 12 months ago
I'm not sure if this is a bug since technically we don't claim CUDA support in coll
according to https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#what-kind-of-cuda-support-exists-in-open-mpi
Let me correct what I said yesterday on Slack. All blocking collectives have accelerator support (not the nonblocking versions). If people are interested, the CUDA coll can be extended to provide support for the nonblocking collectives.
Thanks @bosilca for the discussion in slack.
In my understanding HAN utilizes non-blocking collectives from other coll components for its own blocking collectives - so does that mean HAN in general does not guarantee CUDA support?
So far I've been focusing on reductive collectives, i.e. MPI_Reduce
, MPI_Ireduce
, MPI_Allreduce
, MPI_Iallreduce
. I observe a common failure mode with the corresponding OMB benchmark, e.g. osu_reduce -d cuda
.
I confirmed that both blocking and non-blocking versions have this problem, depending on the coll module.
Both adapt
and libnbc
provides ireduce
, both producing similar segfaults to the original post.
Both tuned
and adapt
produce similar segfaults to the original post.
The ompi_op_reduce
function does not have accelerator awareness and assumes both source and target are system buffers. For certain reduce operations, e.g. SUM as used in OMB, it will call into subroutines such as ompi_op_avx_2buff_sum_int8_t_avx512
. In the above example, this caused a segfault since the source buffer is allocated on CUDA device.
Based on the above finding, it is not straight forward to declare CUDA support in coll
due to implementation differences in the collective modules. Depending on the user tuning, e.g. which module/algorithm is used, the application might get away with most of the collectives on CUDA device, except for non-blocking reductions; however, a change in the tuning could break the application just as easy.
As far as ompi_op_reduce
is concerned, we could possibly introduce accelerator awareness to detect heterogeneous source and target buffers. This might involve additional memory copies between device and host, or some smart on-device reduction tricks. We should be weary of performance impacts, especially for the non-accelerator happy path.
I have only observed the reduction issue so far. I'm not sure what else could cause collectives to fail on CUDA.
An example of protecting ompi_op_reduce
from illegal device memory access
https://github.com/open-mpi/ompi/blob/76b91ce820dd00b017408a3320bea5c76b78af85/ompi/mca/osc/rdma/osc_rdma_accumulate.c#L496-L514
Background information
While I tested Open MPI 5 using OMB. I observed segfaults when running some collective benchmarks with cuda buffer.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Open MPI 5: https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2 OMB: http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Configure Open MPI
Configure OMB
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
pml ob1
Details of the problem
Here is an example with
osu_ireduce
on 4 ranks on a single node.Backtrace:
It appears to be an invalid temp buf in libnbc, note the addresstarget=0x254fbf0