Closed avildema closed 3 years ago
Configuration
OMPI: 4.1.2rc2 MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0 Module: hpcx-gcc (2021-10-27) Test module: none Nodes: jazz x4 (ppn=28(x4), nodelist=jazz[13,25,28,32])
MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/html/test_stdout_W2aefH.txt
Cmd: /hpc/local/benchmarks/daily/next/2021-10-27/hpcx-gcc-redhat7.6/ompi/bin/mpirun -np 112 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' -x UCX_ERROR_SIGNALS=SIGILL,SIGSEGV,SIGBUS,SIGFPE,SIGABRT -x UCX_HANDLE_ERRORS=bt -x UCX_IB_PCI_RELAXED_ORDERING=off --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,self,sm -x UCX_DC_MLX5_TM_ENABLE=y -x UCX_DC_MLX5_TX_POLICY=dcs_quota -mca pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca pmix_base_collect_data 1 --map-by node /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/installs/tnwM/tests/imb/mtt-tests.git/imb/src/IMB-MPI1 -npmin 112 -iter 1000 -mem 0.9 Output:
/hpc/local/benchmarks/daily/next/2021-10-27/hpcx-gcc-redhat7.6/ompi/bin/mpirun -np 112 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' -x UCX_ERROR_SIGNALS=SIGILL,SIGSEGV,SIGBUS,SIGFPE,SIGABRT -x UCX_HANDLE_ERRORS=bt -x UCX_IB_PCI_RELAXED_ORDERING=off --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,self,sm -x UCX_DC_MLX5_TM_ENABLE=y -x UCX_DC_MLX5_TX_POLICY=dcs_quota -mca pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca pmix_base_collect_data 1 --map-by node /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/installs/tnwM/tests/imb/mtt-tests.git/imb/src/IMB-MPI1 -npmin 112 -iter 1000 -mem 0.9
node=jazz13, pid=159498: Thread 4 (Thread 0x7f424eec9700 (LWP 159511)): #0 0x00007f425263d483 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f4251ff4cc3 in epoll_dispatch (base=0x1445bb0, tv=<optimized out>) at epoll.c:407 #2 0x00007f4251ff8710 in opal_libevent2022_event_base_loop (base=0x1445bb0, flags=flags@entry=1) at event.c:1630 #3 0x00007f4251fb25de in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105 #4 0x00007f4252913dd5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f425263cead in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f424b8b8700 (LWP 159520)): #0 0x00007f425263d483 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f4251ff4cc3 in epoll_dispatch (base=0x146b840, tv=<optimized out>) at epoll.c:407 #2 0x00007f4251ff8710 in opal_libevent2022_event_base_loop (base=0x146b840, flags=flags@entry=1) at event.c:1630 #3 0x00007f424e20434e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232 #4 0x00007f4252913dd5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f425263cead in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f4242aae700 (LWP 159600)): #0 0x00007f425263d483 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f4248460aec in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f4242aadec0, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f4248444a90 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f4242aaded0) at sys/event_set.c:198 #2 0x00007f4248444dec in ucs_async_thread_func (arg=0x14ea950) at async/thread.c:130 #3 0x00007f4252913dd5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f425263cead in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f4253022740 (LWP 159498)): #0 0x00007f4243bb5b14 in uct_ib_mlx5_get_cqe (cqe_index=122009, cq=0x156d328) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/ib/mlx5/ib_mlx5.inl:13 #1 uct_rc_mlx5_iface_poll_rx_cq (poll_flags=1, iface=0x1564c30) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/ib/rc/accel/rc_mlx5.inl:277 #2 uct_rc_mlx5_iface_common_poll_rx (poll_flags=1, iface=0x1564c30) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/ib/rc/accel/rc_mlx5.inl:1435 #3 uct_dc_mlx5_iface_progress (flags=1, arg=0x1564c30) at dc/dc_mlx5.c:270 #4 uct_dc_mlx5_iface_progress_tm (arg=0x1564c30) at dc/dc_mlx5.c:290 #5 0x00007f42489ebe52 in ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/ucs/datastruct/callbackq.h:211 #6 uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-gcc-redhat7.6/ucx-master/src/uct/api/uct.h:2592 #7 ucp_worker_progress (worker=0x157b000) at core/ucp_worker.c:2615 #8 0x00007f4251fac74c in opal_progress () at runtime/opal_progress.c:231 #9 0x00007f4252b756c5 in ompi_request_wait_completion (req=0x16cf458) at ../ompi/request/request.h:440 #10 ompi_request_default_wait (req_ptr=0x7ffea774ed60, status=0x7ffea774ed70) at request/req_wait.c:42 #11 0x00007f4252bb1051 in ompi_coll_base_sendrecv_actual (sendbuf=sendbuf@entry=0x16d2990, scount=scount@entry=1, sdatatype=sdatatype@entry=0x6153c0 <ompi_mpi_double>, dest=dest@entry=100, stag=stag@entry=-12, recvbuf=recvbuf@entry=0x7ffea774f070, rcount=rcount@entry=1, rdatatype=rdatatype@entry=0x6153c0 <ompi_mpi_double>, source=source@entry=100, rtag=rtag@entry=-12, comm=comm@entry=0x16d06b0, status=status@entry=0x0) at base/coll_base_util.c:62 #12 0x00007f4252bb2073 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=<optimized out>, rbuf=<optimized out>, count=1, dtype=0x6153c0 <ompi_mpi_double>, op=0x6147c0 <ompi_mpi_op_max>, comm=0x16d06b0, module=0x16e8a70) at base/coll_base_allreduce.c:219 #13 0x00007f4240c36429 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=<optimized out>, rbuf=<optimized out>, count=<optimized out>, dtype=<optimized out>, op=<optimized out>, comm=<optimized out>, module=0x16e8a70) at coll_tuned_decision_fixed.c:216 #14 0x00007f424147c942 in mca_coll_cuda_allreduce (sbuf=0x7ffea774f078, rbuf=0x7ffea774f070, count=1, dtype=0x6153c0 <ompi_mpi_double>, op=0x6147c0 <ompi_mpi_op_max>, comm=0x16d06b0, module=0x16e80f0) at coll_cuda_allreduce.c:63 #15 0x00007f4252b87cdc in PMPI_Allreduce (sendbuf=<optimized out>, recvbuf=<optimized out>, count=<optimized out>, datatype=<optimized out>, op=0x6147c0 <ompi_mpi_op_max>, comm=0x16d06b0) at pallreduce.c:113 #16 0x0000000000406e75 in IMB_init_buffers_iter () #17 0x00000000004022ce in main ()
fixed by #7637
Configuration
MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/html/test_stdout_W2aefH.txt
Cmd:
/hpc/local/benchmarks/daily/next/2021-10-27/hpcx-gcc-redhat7.6/ompi/bin/mpirun -np 112 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' -x UCX_ERROR_SIGNALS=SIGILL,SIGSEGV,SIGBUS,SIGFPE,SIGABRT -x UCX_HANDLE_ERRORS=bt -x UCX_IB_PCI_RELAXED_ORDERING=off --bind-to core -x UCX_NET_DEVICES=mlx5_2:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,self,sm -x UCX_DC_MLX5_TM_ENABLE=y -x UCX_DC_MLX5_TX_POLICY=dcs_quota -mca pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca pmix_base_collect_data 1 --map-by node /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211028_000104_76589_78596_jazz13.swx.labs.mlnx/installs/tnwM/tests/imb/mtt-tests.git/imb/src/IMB-MPI1 -npmin 112 -iter 1000 -mem 0.9
Output: