openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

ucp_worker.c:785 Assertion `progress_count == 0' failed #6525

Open dfaraj opened 3 years ago

dfaraj commented 3 years ago

Describe the bug

I ran many instances of this app across various 64nodes partitions in the cluster, lots of runs went fine but about 22 runs resulted in below assertions.

running on 64 nodes, ppn=96 mpi cmd: mpiexec --hostfile ./hosts -n 6144 -ppn 96 --cpu-bind list:0-15:16-31:32-47:48-63:64-79:80-95:96-111:112-127 -v --no-transfer ../mpi-bench -v -v -v -v 6144x6144x10944

PE 0: MPICH processor detected: PE 0: AMD Rome (23:49:0) (family:model:stepping) MPI VERSION : CRAY MPICH version 8.1.2.3 (ANL base 3.4a2) MPI BUILD INFO : Thu Jan 21 2:43 2021 (git hash a6054a9) (CH4) PE 0: MPICH environment settings ===================================== PE 0: MPICH_ENV_DISPLAY = 1 PE 0: MPICH_VERSION_DISPLAY = 1 PE 0: MPICH_ABORT_ON_ERROR = 0 PE 0: MPICH_CPUMASK_DISPLAY = 0 PE 0: MPICH_STATS_DISPLAY = 0 PE 0: MPICH_RANK_REORDER_METHOD = 1 PE 0: MPICH_RANK_REORDER_DISPLAY = 0 PE 0: MPICH_MEMCPY_MEM_CHECK = 0 PE 0: MPICH_USE_SYSTEM_MEMCPY = 0 PE 0: MPICH_OPTIMIZED_MEMCPY = 1 PE 0: MPICH_ALLOC_MEM_PG_SZ = 4096 PE 0: MPICH_ALLOC_MEM_POLICY = PREFERRED PE 0: MPICH_ALLOC_MEM_AFFINITY = SYS_DEFAULT PE 0: MPICH_MALLOC_FALLBACK = 0 PE 0: MPICH_MEM_DEBUG_FNAME = PE 0: MPICH_INTERNAL_MEM_AFFINITY = SYS_DEFAULT PE 0: MPICH_NO_BUFFER_ALIAS_CHECK = 0 PE 0: MPICH_COLL_SYNC = 0 PE 0: MPICH_USE_GPU_STREAM_TRIGGERED = 0 PE 0: MPICH/RMA environment settings ================================= PE 0: MPICH_RMA_MAX_PENDING = 64 PE 0: MPICH_RMA_SHM_ACCUMULATE = 0 PE 0: MPICH/UCX environment settings ================================= PE 0: MPICH_UCX_VERBOSE = 1 PE 0: MPICH_UCX_RC_MAX_RANKS = 7 PE 0: UCX_TLS = ud,self,sm PE 0: UCX_NET_DEVICES = all PE 0: UCX_IB_REG_METHODS = rcache PE 0: UCX_UD_TIMEOUT = 10m PE 0: UCX_LOG_LEVEL = warn PE 0: MPICH CH4 UCX netmod using transports [ud,self,sm], devices [all] PE 0: MPICH/COLLECTIVE environment settings ========================== PE 0: MPICH_COLL_OPT_OFF = 0 PE 0: MPICH_BCAST_ONLY_TREE = 1 PE 0: MPICH_BCAST_INTERNODE_RADIX = 4 PE 0: MPICH_BCAST_INTRANODE_RADIX = 4 PE 0: MPICH_ALLTOALL_SHORT_MSG = 64-512 PE 0: MPICH_ALLTOALL_SYNC_FREQ = 1-24 PE 0: MPICH_ALLTOALLV_THROTTLE = 8 PE 0: MPICH_ALLGATHER_VSHORT_MSG = 1024-4096 PE 0: MPICH_ALLGATHERV_VSHORT_MSG = 1024-4096 PE 0: MPICH_GATHERV_SHORT_MSG = 131072 PE 0: MPICH_GATHERV_MIN_COMM_SIZE = 64 PE 0: MPICH_GATHERV_MAX_TMP_SIZE = 536870912 PE 0: MPICH_GATHERV_SYNC_FREQ = 16 PE 0: MPICH_IGATHERV_RAND_COMMSIZE = 2048 PE 0: MPICH_IGATHERV_RAND_RECVLIST = 0 PE 0: MPICH_SCATTERV_SHORT_MSG = 2048-8192 PE 0: MPICH_SCATTERV_MIN_COMM_SIZE = 64 PE 0: MPICH_SCATTERV_MAX_TMP_SIZE = 536870912 PE 0: MPICH_SCATTERV_SYNC_FREQ = 16 PE 0: MPICH_SCATTERV_SYNCHRONOUS = 0 PE 0: MPICH_ALLREDUCE_MAX_SMP_SIZE = 262144 PE 0: MPICH_ALLREDUCE_BLK_SIZE = 716800 PE 0: MPICH_ALLREDUCE_NO_SMP = 0 PE 0: MPICH_REDUCE_NO_SMP = 0 PE 0: MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288 PE 0: MPICH_REDUCE_SCATTER_MAX_COMMSIZE = 1000 PE 0: MPICH_SHARED_MEM_COLL_OPT = 1 PE 0: MPICH_SHARED_MEM_COLL_NCELLS = 8 PE 0: MPICH_SHARED_MEM_COLL_CELLSZ = 256 PE 0: MPICH MPIIO environment settings =============================== PE 0: MPICH_MPIIO_HINTS_DISPLAY = 0 PE 0: MPICH_MPIIO_HINTS = NULL PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR = disable PE 0: MPICH_MPIIO_CB_ALIGN = 2 PE 0: MPICH_MPIIO_DVS_MAXNODES = -1 PE 0: MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY = 0 PE 0: MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE = -1 PE 0: MPICH_MPIIO_MAX_NUM_IRECV = 50 PE 0: MPICH_MPIIO_MAX_NUM_ISEND = 50 PE 0: MPICH_MPIIO_MAX_SIZE_ISEND = 10485760 PE 0: MPICH_MPIIO_OFI_STARTUP_CONNECT = disable PE 0: MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR = 2 PE 0: MPICH MPIIO statistics environment settings ==================== PE 0: MPICH_MPIIO_STATS = 0 PE 0: MPICH_MPIIO_TIMERS = 0 PE 0: MPICH_MPIIO_WRITE_EXIT_BARRIER = 1 PE 0: MPICH Thread Safety settings =================================== PE 0: MPICH_OPT_THREAD_SYNC = 1 PE 0: rank 0 required = single, was provided = single [x1003c7s0b0n1:112734:0:112734] ucp_worker.c:785 Assertion progress_count == 0' failed [x1003c4s1b1n1:112117:0:112117] ucp_worker.c:785 Assertionprogress_count == 0' failed [x1003c7s6b0n0:112259:0:112259] ucp_worker.c:785 Assertion progress_count == 0' failed [x1003c4s1b1n1:112071:0:112071] ucp_worker.c:785 Assertionprogress_count == 0' failed [x1003c7s7b0n1:112288:0:112288] ucp_worker.c:785 Assertion `progress_count == 0' failed ==== backtrace (tid: 112734) ==== 0 0x000000000003798a ucp_worker_iface_check_events() /tmp/ucx-master/src/ucp/core/ucp_worker.c:785 1 0x000000000003798a ucp_worker_iface_deactivate() /tmp/ucx-master/src/ucp/core/ucp_worker.c:823 2 0x000000000003798a ucp_worker_iface_init() /tmp/ucx-master/src/ucp/core/ucp_worker.c:1365 3 0x0000000000038408 ucp_worker_add_resource_ifaces() /tmp/ucx-master/src/ucp/core/ucp_worker.c:1117 4 0x000000000003a232 ucp_worker_create() /tmp/ucx-master/src/ucp/core/ucp_worker.c:2108 5 0x00000000005a0d34 MPIDI_UCX_mpi_init_hook() :0 6 0x00000000005503f1 MPID_Init() :0 7 0x00000000000c18fd MPIR_Init_thread() :0 8 0x00000000000c16d4 MPI_Init() ???:0 9 0x0000000000225bf6 main_init() ???:0 10 0x00000000002fb539 bench_main() ???:0 11 0x000000000002434a __libc_start_main() ???:0 12 0x0000000000221cfa _start() /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:120

Steps to Reproduce

mpiexec --hostfile ./hosts -n 6144 -ppn 96 --cpu-bind list:0-15:16-31:32-47:48-63:64-79:80-95:96-111:112-127 -v --no-transfer ../mpi-bench -v -v -v -v 6144x6144x10944

UCX from git master branch (UCT version=1.11.0 revision, configured with: --prefix=/tmp/ucx-master --with-xpmem=no) UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v) UCX envs PE 0: UCX_TLS=ud,self,sm PE 0: UCX_NET_DEVICES=all PE 0: UCX_IB_REG_METHODS=rcache PE 0: UCX_UD_TIMEOUT=10m PE 0: UCX_LOG_LEVEL=warn

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...) Linux x1003c6s3b0n0 5.3.18-22-default #1 SMP Wed Jun 3 12:16:43 UTC 2020 (720aeba) x86_64 x86_64 x86_64 GNU/Linux SUSE Linux Enterprise Server 15 SP2 -nodes are AMD EPYC 7H12 64-Core Processor, 128 cores total

For RDMA/IB/RoCE related issues: Driver version: rdma-core-51mlnx1-1.51237.x86_64 MLNX_OFED_LINUX-5.1-2.3.7.1 CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.26.4012 Hardware version: 0 Node GUID: 0x0040a683bc460000 System image GUID: 0x0040a683bc460000 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x000000fffe005b21 Link layer: Ethernet CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.26.4012 Hardware version: 0 Node GUID: 0x0040a683bc540000 System image GUID: 0x0040a683bc540000 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x000000fffe005a21 Link layer: Ethernet Additional information (depending on the issue)

MPI VERSION : CRAY MPICH version 8.1.2.3 (ANL base 3.4a2) MPI BUILD INFO : Thu Jan 21 2:43 2021 (git hash a6054a9) (CH4)

Output of ucx_info -d to show transports and devices recognized by UCX ucx_info -d | grep Tran Transport: posix Transport: sysv Transport: self Transport: tcp Transport: tcp Transport: tcp Transport: tcp Transport: tcp Transport: rc_verbs Transport: rc_mlx5 Transport: dc_mlx5 Transport: ud_verbs Transport: ud_mlx5 Transport: rc_verbs Transport: rc_mlx5 Transport: dc_mlx5 Transport: ud_verbs Transport: ud_mlx5 Transport: cma

yosefe commented 3 years ago

@dfaraj can you pls try reverting https://github.com/openucx/ucx/pull/6384/commits/9a615a6b941a8d8113704aaec70d7272d67947bc and see if it resolves the issue? thanks!

dfaraj commented 3 years ago

unfortunately it did not. [x1004c6s4b1n0:114645:0:114645] ucp_worker.c:783 Assertion `progress_count == 0' failed ==== backtrace (tid: 114645) ==== 0 /p/app/ucx/master/lib/libucs.so.0(ucs_handle_error+0xe4) [0x1500cecbb934] 1 /p/app/ucx/master/lib/libucs.so.0(ucs_fatal_error_message+0x60) [0x1500cecb8880] 2 /p/app/ucx/master/lib/libucs.so.0(ucs_fatal_error_format+0xd6) [0x1500cecb89b6] 3 /p/app/ucx/master/lib/libucp.so.0(ucp_worker_iface_init+0x580) [0x1500ce5b2870] 4 /p/app/ucx/master/lib/libucp.so.0(+0x37c80) [0x1500ce5b1c80] 5 /p/app/ucx/master/lib/libucp.so.0(ucp_worker_create+0x5f3) [0x1500ce5b44b3] 6 /opt/cray/pe/mpich/8.1.2/ucx/cray/9.1/lib/libmpi_cray.so.12(+0x5a0d34) [0x1500d0e50d34] 7 /opt/cray/pe/mpich/8.1.2/ucx/cray/9.1/lib/libmpi_cray.so.12(+0x5503f1) [0x1500d0e003f1] 8 /opt/cray/pe/mpich/8.1.2/ucx/cray/9.1/lib/libmpi_cray.so.12(+0xc18fd) [0x1500d09718fd] 9 /opt/cray/pe/mpich/8.1.2/ucx/cray/9.1/lib/libmpi_cray.so.12(PMPI_Init+0x94) [0x1500d09716d4] 10 /p/home/dfaraj/benchmarks/FFTW/mpi-bench() [0x225bf6] 11 /p/home/dfaraj/benchmarks/FFTW/mpi-bench() [0x2fb539] 12 /lib64/libc.so.6(__libc_start_main+0xea) [0x1500cef0e34a] 13 /p/home/dfaraj/benchmarks/FFTW/mpi-bench() [0x221cfa]

yosefe commented 3 years ago

@dfaraj can you pls try removing UCX_TLS, or changing it to UCX_TLS=dc,self,sm?

dfaraj commented 3 years ago

Yossi

just to be clear, I earlier did a git reset <commit before 9a615a6>

if i try to do a git revert 9a615a6, i get: dfaraj@dclogin2:~/ucx> git revert 9a615a6 Auto-merging src/uct/ib/ud/verbs/ud_verbs.c CONFLICT (content): Merge conflict in src/uct/ib/ud/verbs/ud_verbs.c Auto-merging src/uct/ib/ud/base/ud_iface.h CONFLICT (content): Merge conflict in src/uct/ib/ud/base/ud_iface.h Auto-merging src/uct/ib/ud/base/ud_iface.c CONFLICT (content): Merge conflict in src/uct/ib/ud/base/ud_iface.c Auto-merging src/uct/ib/ud/accel/ud_mlx5.c CONFLICT (content): Merge conflict in src/uct/ib/ud/accel/ud_mlx5.c error: could not revert 9a615a6b9... UCT/UD: Count pending async events in progress hint: after resolving the conflicts, mark the corrected paths hint: with 'git add ' or 'git rm ' hint: and commit the result with 'git commit'

dfaraj commented 3 years ago

Yossi Any comments regarding my last post?