openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 425 forks source link

[mtt] Fail in one-sided/osu_put_bibw on UCX + UCC + OMPI5x + CUDA #7538

Open avildema opened 3 years ago

avildema commented 3 years ago

Configuration

Nodes: vulcan x2 (ppn=28(x2), nodelist=jazz[01,16])
OFED: MLNX_OFED_LINUX-5.4-1.0.3.0
MPI: v5.0.0rc1

  MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/html/test_stdout_Q2QozI.txt

build:

module load hpcx-env/cuda
module load hpcx-env/gdrcopy

# UCX
./autogen.sh
./contrib/configure-devel --disable-mt --with-cuda=$CUDA_HOME --prefix=@ucx_dir@
make -j 9
make -j 9 install

#UCC
./autogen.sh
./configure --with-cuda=$CUDA_HOME --prefix=@ucc_dir@ --with-ucx=@ucx_dir@
make -j 9
make -j 9 install

#OMPI
./autogen.pl
./configure --disable-man-pages --prefix=@mpi_home@ --with-cuda=$CUDA_HOME --with-ucx=@ucx_dir@ --with-ucc=@ucc_dir@
make -j9
make -j9 install

Cmd: /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/installs/okwU/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw   Output:

=============================================================
# OSU MPI_Put Bi-directional Bandwidth Test v5.6.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Size      Bandwidth (MB/s)[vulcan02:19301:0:19301] ib_mlx5_log.c:170  Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[vulcan02:19301:0:19301] ib_mlx5_log.c:170  RC QP 0x2f57 wqe[0]: FETCH_ADD s-- [rva 0x562d778 rkey 0x2bf8] [add 1] [va 0x7fee5d251fb8 len 8 lkey 0x1236dd] [rqpn 0x2a65 dlid=6 sl=0 port=1 src_path_bits=0]
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
      ...      159                  uct_ib_mlx5_cqe_err_opcode(ecqe));
      160     }
      161 
==>   162     ucs_log(log_level,
      163             "%s on " UCT_IB_IFACE_FMT
      164             "/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
      165             "%s QP 0x%x wqe[%d]: %s %s",
==== backtrace (tid:  19301) ====
 0 0x0000000000027a64 uct_ib_mlx5_completion_with_err()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/mlx5/ib_mlx5_log.c:162
 1 0x0000000000051c6b uct_ib_mlx5_poll_cq()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/mlx5/ib_mlx5.inl:91
 2 0x0000000000051c6b uct_ib_mlx5_poll_cq()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/mlx5/ib_mlx5.inl:92
 3 0x0000000000051c6b uct_rc_mlx5_iface_poll_tx()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/rc/accel/rc_mlx5_iface.c:135
 4 0x0000000000051c6b uct_rc_mlx5_iface_progress()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/rc/accel/rc_mlx5_iface.c:173
 5 0x0000000000051c6b uct_rc_mlx5_iface_progress_cyclic()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/ib/rc/accel/rc_mlx5_iface.c:178
 6 0x0000000000051fc2 ucs_callbackq_dispatch()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/ucs/datastruct/callbackq.h:211
 7 0x0000000000051fc2 uct_worker_progress()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/uct/api/uct.h:2591
 8 0x0000000000051fc2 ucp_worker_progress()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/ucx_src/src/ucp/core/ucp_worker.c:2568
 9 0x0000000000209ae6 ompi_osc_ucx_post()  ???:0
10 0x00000000000fa6db MPI_Win_post()  ???:0
11 0x0000000000402919 run_put_with_pscw()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/installs/okwU/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw.c:239
12 0x00000000004021f7 main()  /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20211013_193107_15542_75841_vulcan02.swx.labs.mlnx/installs/okwU/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw.c:116
13 0x00000000000223d5 __libc_start_main()  ???:0
14 0x00000000004022c7 _start()  ???:0
=================================[vulcan02:19301:0:19301] Process frozen...
yosefe commented 3 years ago

@janjust WDYT, can it be issue in OMPI (osc/ucx) ?

janjust commented 3 years ago

Probably not cuda related but osc related - I'll take it

janjust commented 2 years ago

@yosefe @avildema , Looked into this, this should be closed here, and moved to ompi, this is a ompi/osc/ucx issue, not ucx itself. cuda is not relevant