Open jinz2014 opened 2 months ago
@jinz2014 this is most likely a system setup / permission issue on your side, since UCX 1.15 has been used extensively with numerous application on MI100.
Can you please check the following things:
video
and render
groups on the node that you are using? Can you for example execute rocminfo
and get the correct output (this is the most likely issue that I can think off)The answers are yes to both questions. I didn't paste the result completely. The program starts to produce error message after initial successful execution
Verified allreduce for size 0 (19.865 us per iteration) Verified allreduce for size 32 (52.7884 us per iteration) Verified allreduce for size 256 (94.3108 us per iteration) Verified allreduce for size 1024 (73.2143 us per iteration) Verified allreduce for size 4096 (88.3691 us per iteration) [1724605863.595828] [cousteau:2757379:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f8d0fa10000/8000 [1724605863.595828] [cousteau:2757380:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7effaba18000/8000 [cousteau:2757380:0:2757380] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2757379:0:2757379] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed ==== backtrace (tid:2757380) ==== 0 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f01d67bbd84] 1 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xc2) [0x7f01d67b8dc2] 2 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_format+0x11a) [0x7f01d67b8eea] 3 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x1b8) [0x7f01d68a8a08] 4 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_atp_handler+0x217) [0x7f01d68a9ac7] 5 /home/user/ompi_for_gpu/ucx/lib/libuct.so.0(+0x1c6ad) [0x7f01cd7916ad] 6 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x7f01d6859e3a] 7 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(mca_pml_ucx_send+0x1bf) [0x7f01d8bd21df] 8 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(MPI_Send+0x183) [0x7f01d8a59b63] 9 ./main() [0x206a9f] 10 ./main() [0x205b6a] 11 ./main() [0x2060cd] 12 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f01d698ed90] 13 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f01d698ee40] 14 ./main() [0x205835]
Could you please provide the full command line that you used? I see that the put_zcopy protocol is being utilized, which is not the default with 1.15, it should be the get_zcopy protocol.
Sorry, I don't know the two protocols.
"make run" shows the full command:
$HOME/ompi_for_gpu/ompi/bin/mpirun -n 2 ./main
Thank you for the instructions.
So just for a test, could you change the command line to the following:
$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main
to see whether it makes a difference?
Ok.
$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main [1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?) [1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) [1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?) [1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) Verified allreduce for size 0 (20.0202 us per iteration) Verified allreduce for size 32 (52.4041 us per iteration) Verified allreduce for size 256 (91.5858 us per iteration) Verified allreduce for size 1024 (67.5217 us per iteration) Verified allreduce for size 4096 (79.6616 us per iteration) [1724695938.071148] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000 [1724695938.071145] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc618000/8000 [1724695938.071299] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc620000/8000 [1724695938.071304] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f820000/8000 [1724695938.071585] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f818000/8000 [1724695938.071597] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000
Hm. Ok, I will see whether I can reproduce the issue locally. Are there instructions on how to compile the testcode on the github repo?
export INSTALL_DIR=$HOME/ompi_for_gpu export BUILD_DIR=/tmp/ompi_for_gpu_build mkdir -p $BUILD_DIR
export UCX_DIR=$INSTALL_DIR/ucx cd $BUILD_DIR git clone https://github.com/openucx/ucx.git -b v1.15.x cd ucx ./autogen.sh mkdir build cd build ../configure -prefix=$UCX_DIR --with-rocm=/opt/rocm make -j $(nproc) make -j $(nproc) install
export OMPI_DIR=$INSTALL_DIR/ompi cd $BUILD_DIR git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x cd ompi ./autogen.pl mkdir build cd build ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR --with-rocm=/opt/rocm make -j $(nproc) make install
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib export PATH=$OMPI_DIR/bin:$PATH
The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip
make run
will build and run the program.
The CUDA example is migrated to the HIP example. I didn't observe errors when running the CUDA code, so am not clear where the issue in the HIP example is. https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-cuda
Thanks
ok, so but just clarify, compiling the example is simply make run
( I am compiling UCX and Open MPI on a daily bases, that is not the challenge :-) )
make run hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c main.cu -o main.o hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c collectives.cu -o collectives.o hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c timer.cu -o timer.o hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 main.o collectives.o timer.o -o main -L$HOME/ompi_for_gpu/ompi/lib -lmpi -DOMPI_SKIP_MPICXX= $HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main
The original CUDA code is https://github.com/baidu-research/baidu-allreduce
I can confirm that I can reproduce the issue. It is in my case an MI250X system with ROCm 6.2 and UCX 1.16 (that is my default development platform at the moment), but the same error is occurring. I will put it on my list of items to work on, but it might be more towards the end of the week until I get to it.
Okay.
I think I know what the issue is, but I do not know yet whether its something that we are doing wrong in the rocm components of UCX or whether its a bug in ROCm runtime layer.
I have however a quick workaround in your code (since a proper fix might take a while):
If you allocate the output buffer outside of the RingAllreduce test and pass it in as an argument to RingAllreduce (e.g. allocate just right before the for(size_t iter = 0; iter < iters; iter++)
loop and perform in the for loop body a hipMemset(output, 0, size * sizeof(float))
before calling RingAllreduce), you avoid the hipMalloc() + hipFree()
of the buffer for every iteration (and do it just once for every message size). With this modification, the test passes for me.
Let me emphasize that your code is however correct, and it should work.
I added another example https://github.com/zjin-lcf/HeCBench/blob/master/src/pingpong-hip/main.cu Does running the example cause similar errors ?
Thank you for the workaround.
Describe the issue
[1724610589.249079] [cousteau:2779987:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000 [1724610589.249092] [cousteau:2779986:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fd7af610000/8000 [cousteau:2779987:0:2779987] rndv.c:1872 Assertion
sreq->send.rndv.lanes_count > 0' failed [cousteau:2779986:0:2779986] rndv.c:1872 Assertion
sreq->send.rndv.lanes_count > 0' failedSteps to Reproduce
export INSTALL_DIR=$HOME/ompi_for_gpu export BUILD_DIR=/tmp/ompi_for_gpu_build mkdir -p $BUILD_DIR
export UCX_DIR=$INSTALL_DIR/ucx cd $BUILD_DIR git clone https://github.com/openucx/ucx.git -b v1.15.x cd ucx ./autogen.sh mkdir build cd build ../configure -prefix=$UCX_DIR \ --with-rocm=/opt/rocm make -j $(nproc) make -j $(nproc) install
export OMPI_DIR=$INSTALL_DIR/ompi cd $BUILD_DIR git clone --recursive https://github.com/open-mpi/ompi.git \ -b v5.0.x cd ompi ./autogen.pl mkdir build cd build ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \ --with-rocm=/opt/rocm make -j $(nproc) make install
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib export PATH=$OMPI_DIR/bin:$PATH
The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip
make run
Setup and versions