openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 427 forks source link

[mtt][ucx-hwtm][imb] Test hung #7596

Closed avildema closed 3 years ago

avildema commented 3 years ago

Configuration

OMPI: 4.1.2rc2
MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0
Module: hpcx-gcc (2021-10-27)
Test module: none
Nodes: jazz x4 (ppn=28(x4), nodelist=jazz[13,25,28,32])

  MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/html/test_stdout_W2aefH.txt

Cmd: /hpc/local/benchmarks/daily/next/2021-10-27/hpcx-gcc-redhat7.6/ompi/bin/mpirun -np 112 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' -x UCX_ERROR_SIGNALS=SIGILL,SIGSEGV,SIGBUS,SIGFPE,SIGABRT -x UCX_HANDLE_ERRORS=bt -x UCX_IB_PCI_RELAXED_ORDERING=off --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,self,sm -x UCX_DC_MLX5_TM_ENABLE=y -x UCX_DC_MLX5_TX_POLICY=dcs_quota -mca pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca pmix_base_collect_data 1 --map-by node /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/installs/tnwM/tests/imb/mtt-tests.git/imb/src/IMB-MPI1 -npmin 112 -iter 1000 -mem 0.9   Output:

node=jazz13, pid=159498:
Thread 4 (Thread 0x7f424eec9700 (LWP 159511)):
#0  0x00007f425263d483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f4251ff4cc3 in epoll_dispatch (base=0x1445bb0, tv=<optimized out>) at epoll.c:407
#2  0x00007f4251ff8710 in opal_libevent2022_event_base_loop (base=0x1445bb0, flags=flags@entry=1) at event.c:1630
#3  0x00007f4251fb25de in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f4252913dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f425263cead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f424b8b8700 (LWP 159520)):
#0  0x00007f425263d483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f4251ff4cc3 in epoll_dispatch (base=0x146b840, tv=<optimized out>) at epoll.c:407
#2  0x00007f4251ff8710 in opal_libevent2022_event_base_loop (base=0x146b840, flags=flags@entry=1) at event.c:1630
#3  0x00007f424e20434e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f4252913dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f425263cead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f4242aae700 (LWP 159600)):
#0  0x00007f425263d483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f4248460aec in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f4242aadec0, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f4248444a90 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f4242aaded0) at sys/event_set.c:198
#2  0x00007f4248444dec in ucs_async_thread_func (arg=0x14ea950) at async/thread.c:130
#3  0x00007f4252913dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f425263cead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f4253022740 (LWP 159498)):
#0  0x00007f4243bb5b14 in uct_ib_mlx5_get_cqe (cqe_index=122009, cq=0x156d328) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/ib/mlx5/ib_mlx5.inl:13
#1  uct_rc_mlx5_iface_poll_rx_cq (poll_flags=1, iface=0x1564c30) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/ib/rc/accel/rc_mlx5.inl:277
#2  uct_rc_mlx5_iface_common_poll_rx (poll_flags=1, iface=0x1564c30) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/ib/rc/accel/rc_mlx5.inl:1435
#3  uct_dc_mlx5_iface_progress (flags=1, arg=0x1564c30) at dc/dc_mlx5.c:270
#4  uct_dc_mlx5_iface_progress_tm (arg=0x1564c30) at dc/dc_mlx5.c:290
#5  0x00007f42489ebe52 in ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/ucs/datastruct/callbackq.h:211
#6  uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/api/uct.h:2592
#7  ucp_worker_progress (worker=0x157b000) at core/ucp_worker.c:2615
#8  0x00007f4251fac74c in opal_progress () at runtime/opal_progress.c:231
#9  0x00007f4252b756c5 in ompi_request_wait_completion (req=0x16cf458) at ../ompi/request/request.h:440
#10 ompi_request_default_wait (req_ptr=0x7ffea774ed60, status=0x7ffea774ed70) at request/req_wait.c:42
#11 0x00007f4252bb1051 in ompi_coll_base_sendrecv_actual (sendbuf=sendbuf@entry=0x16d2990, scount=scount@entry=1, sdatatype=sdatatype@entry=0x6153c0 <ompi_mpi_double>, dest=dest@entry=100, stag=stag@entry=-12, recvbuf=recvbuf@entry=0x7ffea774f070, rcount=rcount@entry=1, rdatatype=rdatatype@entry=0x6153c0 <ompi_mpi_double>, source=source@entry=100, rtag=rtag@entry=-12, comm=comm@entry=0x16d06b0, status=status@entry=0x0) at base/coll_base_util.c:62
#12 0x00007f4252bb2073 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=<optimized out>, rbuf=<optimized out>, count=1, dtype=0x6153c0 <ompi_mpi_double>, op=0x6147c0 <ompi_mpi_op_max>, comm=0x16d06b0, module=0x16e8a70) at base/coll_base_allreduce.c:219
#13 0x00007f4240c36429 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=<optimized out>, rbuf=<optimized out>, count=<optimized out>, dtype=<optimized out>, op=<optimized out>, comm=<optimized out>, module=0x16e8a70) at coll_tuned_decision_fixed.c:216
#14 0x00007f424147c942 in mca_coll_cuda_allreduce (sbuf=0x7ffea774f078, rbuf=0x7ffea774f070, count=1, dtype=0x6153c0 <ompi_mpi_double>, op=0x6147c0 <ompi_mpi_op_max>, comm=0x16d06b0, module=0x16e80f0) at coll_cuda_allreduce.c:63
#15 0x00007f4252b87cdc in PMPI_Allreduce (sendbuf=<optimized out>, recvbuf=<optimized out>, count=<optimized out>, datatype=<optimized out>, op=0x6147c0 <ompi_mpi_op_max>, comm=0x16d06b0) at pallreduce.c:113
#16 0x0000000000406e75 in IMB_init_buffers_iter ()
#17 0x00000000004022ce in main ()
yosefe commented 3 years ago

fixed by #7637