Closed Akshay-Venkatesh closed 4 years ago
When rc is removed from UCX_TLS to move data between GPUs that are not cuda-ipc accessible, the above assertion issue shows up. This happened when trying to see if https://github.com/openucx/ucx/pull/5473 addresses https://github.com/openucx/ucx/issues/3249
rc
$ mpirun -mca btl ^openib -mca pml ucx -np 2 --map-by ppr:1:socket -x UCX_TLS=mm,cuda_copy,cuda_ipc,gdr_copy -x UCX_MEMTYPE_CACHE=n -x MUCX_MAX_RNDV_RAILS=1 -x CUDA_VISIBLE_DEVICES=0,5 -x LD_LIBRARY_PATH ./get_local_ompi_rank_hca mpi/pt2pt/osu_bw -m 1:$((2 ** 22)) D D local rank 0: using hca mlx5_0:1 openib using mlx5_0 local rank 1: using hca mlx5_2:1 openib using mlx5_2 # OSU MPI-CUDA Bandwidth Test v5.6.2 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.44 2 0.62 4 1.23 8 2.46 16 7.19 32 9.88 64 19.00 128 37.75 256 74.91 512 111.66 1024 195.81 2048 282.60 4096 339.61 8192 388.52 [prm-dgx-16:37750:0:37750] rndv.c:1689 Assertion `!(rreq->flags & UCP_REQUEST_FLAG_RNDV_FRAG)' failed ==== backtrace (tid: 37750) ==== 0 $UCX_HOME/lib/libucs.so.0(ucs_handle_error+0x73) [0x7fcf78182f1f] 1 $UCX_HOME/lib/libucs.so.0(ucs_fatal_error_message+0xdf) [0x7fcf7818030c] 2 $UCX_HOME/lib/libucs.so.0(+0x2b49a) [0x7fcf7818049a] 3 $UCX_HOME/lib/libucp.so.0(ucp_rndv_data_handler+0x10b) [0x7fcf7888947e] 4 $UCX_HOME/lib/libuct.so.0(+0x175ea) [0x7fcf785fa5ea] 5 $UCX_HOME/lib/libuct.so.0(+0x18024) [0x7fcf785fb024] 6 $UCX_HOME/lib/libucp.so.0(+0x3753f) [0x7fcf7886553f] 7 $UCX_HOME/lib/libucp.so.0(ucp_worker_progress+0x137) [0x7fcf7886d96f] 8 $MPI_HOME/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x7fcf78b3d027] 9 $MPI_HOME/lib/libopen-pal.so.40(opal_progress+0x48) [0x7fcfb7c1d1e8] 10 $MPI_HOME/lib/libmpi.so.40(ompi_request_default_wait_all+0x4c9) [0x7fcfbb81efc9] 11 $MPI_HOME/lib/libmpi.so.40(PMPI_Waitall+0x337) [0x7fcfbb9242f7] 12 mpi/pt2pt/osu_bw() [0x4027b7] =================================
cc @bureddy
@Akshay-Venkatesh will check.
@bureddy thanks for the fix.
Describe the bug
When
rc
is removed from UCX_TLS to move data between GPUs that are not cuda-ipc accessible, the above assertion issue shows up. This happened when trying to see if https://github.com/openucx/ucx/pull/5473 addresses https://github.com/openucx/ucx/issues/3249Steps to Reproduce
cc @bureddy