openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 427 forks source link

rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000 #10087

Open jinz2014 opened 2 months ago

jinz2014 commented 2 months ago

Describe the issue

[1724610589.249079] [cousteau:2779987:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000 [1724610589.249092] [cousteau:2779986:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fd7af610000/8000 [cousteau:2779987:0:2779987] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2779986:0:2779986] rndv.c:1872 Assertionsreq->send.rndv.lanes_count > 0' failed

Steps to Reproduce

export INSTALL_DIR=$HOME/ompi_for_gpu export BUILD_DIR=/tmp/ompi_for_gpu_build mkdir -p $BUILD_DIR

export UCX_DIR=$INSTALL_DIR/ucx cd $BUILD_DIR git clone https://github.com/openucx/ucx.git -b v1.15.x cd ucx ./autogen.sh mkdir build cd build ../configure -prefix=$UCX_DIR \ --with-rocm=/opt/rocm make -j $(nproc) make -j $(nproc) install

export OMPI_DIR=$INSTALL_DIR/ompi cd $BUILD_DIR git clone --recursive https://github.com/open-mpi/ompi.git \ -b v5.0.x cd ompi ./autogen.pl mkdir build cd build ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \ --with-rocm=/opt/rocm make -j $(nproc) make install

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib export PATH=$OMPI_DIR/bin:$PATH

The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip

make run

Setup and versions

edgargabriel commented 2 months ago

@jinz2014 this is most likely a system setup / permission issue on your side, since UCX 1.15 has been used extensively with numerous application on MI100.

Can you please check the following things:

jinz2014 commented 2 months ago

The answers are yes to both questions. I didn't paste the result completely. The program starts to produce error message after initial successful execution

Verified allreduce for size 0 (19.865 us per iteration) Verified allreduce for size 32 (52.7884 us per iteration) Verified allreduce for size 256 (94.3108 us per iteration) Verified allreduce for size 1024 (73.2143 us per iteration) Verified allreduce for size 4096 (88.3691 us per iteration) [1724605863.595828] [cousteau:2757379:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f8d0fa10000/8000 [1724605863.595828] [cousteau:2757380:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7effaba18000/8000 [cousteau:2757380:0:2757380] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2757379:0:2757379] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed ==== backtrace (tid:2757380) ==== 0 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f01d67bbd84] 1 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xc2) [0x7f01d67b8dc2] 2 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_format+0x11a) [0x7f01d67b8eea] 3 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x1b8) [0x7f01d68a8a08] 4 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_atp_handler+0x217) [0x7f01d68a9ac7] 5 /home/user/ompi_for_gpu/ucx/lib/libuct.so.0(+0x1c6ad) [0x7f01cd7916ad] 6 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x7f01d6859e3a] 7 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(mca_pml_ucx_send+0x1bf) [0x7f01d8bd21df] 8 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(MPI_Send+0x183) [0x7f01d8a59b63] 9 ./main() [0x206a9f] 10 ./main() [0x205b6a] 11 ./main() [0x2060cd] 12 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f01d698ed90] 13 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f01d698ee40] 14 ./main() [0x205835]

edgargabriel commented 2 months ago

Could you please provide the full command line that you used? I see that the put_zcopy protocol is being utilized, which is not the default with 1.15, it should be the get_zcopy protocol.

jinz2014 commented 2 months ago

Sorry, I don't know the two protocols.

"make run" shows the full command:

$HOME/ompi_for_gpu/ompi/bin/mpirun -n 2 ./main

Thank you for the instructions.

edgargabriel commented 2 months ago

So just for a test, could you change the command line to the following:

$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main

to see whether it makes a difference?

jinz2014 commented 2 months ago

Ok.

$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main [1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?) [1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) [1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?) [1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) Verified allreduce for size 0 (20.0202 us per iteration) Verified allreduce for size 32 (52.4041 us per iteration) Verified allreduce for size 256 (91.5858 us per iteration) Verified allreduce for size 1024 (67.5217 us per iteration) Verified allreduce for size 4096 (79.6616 us per iteration) [1724695938.071148] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000 [1724695938.071145] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc618000/8000 [1724695938.071299] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc620000/8000 [1724695938.071304] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f820000/8000 [1724695938.071585] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f818000/8000 [1724695938.071597] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000

edgargabriel commented 2 months ago

Hm. Ok, I will see whether I can reproduce the issue locally. Are there instructions on how to compile the testcode on the github repo?

jinz2014 commented 2 months ago

export INSTALL_DIR=$HOME/ompi_for_gpu export BUILD_DIR=/tmp/ompi_for_gpu_build mkdir -p $BUILD_DIR

export UCX_DIR=$INSTALL_DIR/ucx cd $BUILD_DIR git clone https://github.com/openucx/ucx.git -b v1.15.x cd ucx ./autogen.sh mkdir build cd build ../configure -prefix=$UCX_DIR --with-rocm=/opt/rocm make -j $(nproc) make -j $(nproc) install

export OMPI_DIR=$INSTALL_DIR/ompi cd $BUILD_DIR git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x cd ompi ./autogen.pl mkdir build cd build ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR --with-rocm=/opt/rocm make -j $(nproc) make install

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib export PATH=$OMPI_DIR/bin:$PATH

The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip

make run

will build and run the program.

The CUDA example is migrated to the HIP example. I didn't observe errors when running the CUDA code, so am not clear where the issue in the HIP example is. https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-cuda

Thanks

edgargabriel commented 2 months ago

ok, so but just clarify, compiling the example is simply make run ( I am compiling UCX and Open MPI on a daily bases, that is not the challenge :-) )

jinz2014 commented 2 months ago

make run hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c main.cu -o main.o hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c collectives.cu -o collectives.o hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c timer.cu -o timer.o hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 main.o collectives.o timer.o -o main -L$HOME/ompi_for_gpu/ompi/lib -lmpi -DOMPI_SKIP_MPICXX= $HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main

The original CUDA code is https://github.com/baidu-research/baidu-allreduce

edgargabriel commented 2 months ago

I can confirm that I can reproduce the issue. It is in my case an MI250X system with ROCm 6.2 and UCX 1.16 (that is my default development platform at the moment), but the same error is occurring. I will put it on my list of items to work on, but it might be more towards the end of the week until I get to it.

jinz2014 commented 2 months ago

Okay.

edgargabriel commented 2 months ago

I think I know what the issue is, but I do not know yet whether its something that we are doing wrong in the rocm components of UCX or whether its a bug in ROCm runtime layer.

I have however a quick workaround in your code (since a proper fix might take a while):

If you allocate the output buffer outside of the RingAllreduce test and pass it in as an argument to RingAllreduce (e.g. allocate just right before the for(size_t iter = 0; iter < iters; iter++) loop and perform in the for loop body a hipMemset(output, 0, size * sizeof(float)) before calling RingAllreduce), you avoid the hipMalloc() + hipFree() of the buffer for every iteration (and do it just once for every message size). With this modification, the test passes for me.

Let me emphasize that your code is however correct, and it should work.

jinz2014 commented 2 months ago

I added another example https://github.com/zjin-lcf/HeCBench/blob/master/src/pingpong-hip/main.cu Does running the example cause similar errors ?

Thank you for the workaround.