coll collectives segfaults on CUDA buffers

wenduwan commented 12 months ago

Background information

While I tested Open MPI 5 using OMB. I observed segfaults when running some collective benchmarks with cuda buffer.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 5: https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2 OMB: http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Configure Open MPI

$ ./configure --enable-debug --with-cuda=/usr/local/cuda --with-cuda-libdir=/lib64

Configure OMB

$ ./configure --with-cuda=/usr/local/cuda --enable-cuda CC=/path/to/ompi5 CXX=/path/to/ompi5
$ PATH=/usr/local/cuda/bin:$PATH make -j install

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: Amazon Linux 2. Can also reproduce on Ubuntu 22.04. Installed CUDA 12.2 and 535 driver.
Computer hardware: p4d.24xlarge instance with A100 GPU
Network type: EFA. Can also reproduce with pml ob1

Details of the problem

Here is an example with osu_ireduce on 4 ranks on a single node.

$ mpirun -n 4 --mca pml ob1 --mca coll_base_verbose 1 osu-micro-benchmarks/mpi/collective/osu_ireduce -d cuda
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07270] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07272] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07270] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
# OSU MPI-CUDA Non-blocking Reduce Latency Test
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62:07270] *** Process received signal ***
[ip-172-31-31-62:07270] Signal: Segmentation fault (11)
[ip-172-31-31-62:07270] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07270] Failing at address: 0x7fb321200000
[ip-172-31-31-62:07272] *** Process received signal ***
[ip-172-31-31-62:07272] Signal: Segmentation fault (11)
[ip-172-31-31-62:07272] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07272] Failing at address: 0x7fe881200000
[ip-172-31-31-62:07270] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fb32c69c7be]
[ip-172-31-31-62:07270] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fb356ed08e0]
[ip-172-31-31-62:07270] [ 2] [ip-172-31-31-62:07272] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fe8687417be]
[ip-172-31-31-62:07272] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fe8b60b58e0]
[ip-172-31-31-62:07272] [ 2] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fb35767cc1c]
[ip-172-31-31-62:07270] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fe8b6861c1c]
[ip-172-31-31-62:07272] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fb3574c6411]
[ip-172-31-31-62:07270] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fe8b66ab411]
[ip-172-31-31-62:07272] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fb3574c7e89]
[ip-172-31-31-62:07270] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fe8b66ace89]
[ip-172-31-31-62:07272] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fb3574c77eb]
[ip-172-31-31-62:07270] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fe8b66ac7eb]
[ip-172-31-31-62:07272] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fb3574c508d]
[ip-172-31-31-62:07270] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fb3563cfcc6]
[ip-172-31-31-62:07270] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fe8b66aa08d]
[ip-172-31-31-62:07272] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fb35739635b]
[ip-172-31-31-62:07270] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fe8b55b4cc6]
[ip-172-31-31-62:07272] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fe8b657b35b]
[ip-172-31-31-62:07272] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fb3573963c4]
[ip-172-31-31-62:07270] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fe8b657b3c4]
[ip-172-31-31-62:07272] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fb3574355f1]
[ip-172-31-31-62:07270] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07270] [12] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fb356b3313a]
[ip-172-31-31-62:07270] [13] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x40332a]
/home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fe8b661a5f1]
[ip-172-31-31-62:07272] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07272] [12] [ip-172-31-31-62:07270] *** End of error message ***

Backtrace:

#0  0x00007fb35767cc1c in ompi_op_avx_2buff_add_float_avx512 (_in=0x7fb321200000, _out=0x254fbf0, count=0x7ffcb82a3fc4, dtype=0x7ffcb82a3f88, module=0x181ff30)
    at op_avx_functions.c:680
#1  0x00007fb3574c6411 in ompi_op_reduce (op=0x62c760 <ompi_mpi_op_sum>, source=0x7fb321200000, target=0x254fbf0, full_count=1, dtype=0x62e3a0 <ompi_mpi_float>)
    at ../../../../ompi/op/op.h:572
#2  0x00007fb3574c7e89 in NBC_Start_round (handle=0x25540e8) at nbc.c:539
#3  0x00007fb3574c77eb in NBC_Progress (handle=0x25540e8) at nbc.c:419
#4  0x00007fb3574c508d in ompi_coll_libnbc_progress () at coll_libnbc_component.c:445
#5  0x00007fb3563cfcc6 in opal_progress () at runtime/opal_progress.c:224
#6  0x00007fb35739635b in ompi_request_wait_completion (req=0x25540e8) at ../ompi/request/request.h:492
#7  0x00007fb3573963c4 in ompi_request_default_wait (req_ptr=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at request/req_wait.c:40
#8  0x00007fb3574355f1 in PMPI_Wait (request=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at wait.c:72
#9  0x0000000000402a8c in main (argc=<optimized out>, argv=<optimized out>) at osu_ireduce.c:136

~~It appears to be an invalid temp buf in libnbc, note the address target=0x254fbf0~~

wenduwan commented 12 months ago

I'm not sure if this is a bug since technically we don't claim CUDA support in coll according to https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#what-kind-of-cuda-support-exists-in-open-mpi

bosilca commented 12 months ago

Let me correct what I said yesterday on Slack. All blocking collectives have accelerator support (not the nonblocking versions). If people are interested, the CUDA coll can be extended to provide support for the nonblocking collectives.

wenduwan commented 12 months ago

Thanks @bosilca for the discussion in slack.

In my understanding HAN utilizes non-blocking collectives from other coll components for its own blocking collectives - so does that mean HAN in general does not guarantee CUDA support?

wenduwan commented 11 months ago

Update 11/9

So far I've been focusing on reductive collectives, i.e. MPI_Reduce, MPI_Ireduce, MPI_Allreduce, MPI_Iallreduce. I observe a common failure mode with the corresponding OMB benchmark, e.g. osu_reduce -d cuda.

I confirmed that both blocking and non-blocking versions have this problem, depending on the coll module.

Non-blocking reduction

Both adapt and libnbc provides ireduce, both producing similar segfaults to the original post.

Blocking reduction

Both tuned and adapt produce similar segfaults to the original post.

Cause

The ompi_op_reduce function does not have accelerator awareness and assumes both source and target are system buffers. For certain reduce operations, e.g. SUM as used in OMB, it will call into subroutines such as ompi_op_avx_2buff_sum_int8_t_avx512. In the above example, this caused a segfault since the source buffer is allocated on CUDA device.

https://github.com/open-mpi/ompi/blob/b816edf34a369cb860a31d4896075a3e48e83114/ompi/op/op.h#L503-L538

Implication

Based on the above finding, it is not straight forward to declare CUDA support in coll due to implementation differences in the collective modules. Depending on the user tuning, e.g. which module/algorithm is used, the application might get away with most of the collectives on CUDA device, except for non-blocking reductions; however, a change in the tuning could break the application just as easy.

Mitigation

As far as ompi_op_reduce is concerned, we could possibly introduce accelerator awareness to detect heterogeneous source and target buffers. This might involve additional memory copies between device and host, or some smart on-device reduction tricks. We should be weary of performance impacts, especially for the non-accelerator happy path.

Unknowns

I have only observed the reduction issue so far. I'm not sure what else could cause collectives to fail on CUDA.

wenduwan commented 11 months ago

An example of protecting ompi_op_reduce from illegal device memory access https://github.com/open-mpi/ompi/blob/76b91ce820dd00b017408a3320bea5c76b78af85/ompi/mca/osc/rdma/osc_rdma_accumulate.c#L496-L514

open-mpi / ompi