pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
534 stars 280 forks source link

pt2pt/probe_unexp times out under multi-vci #5626

Closed sagarth closed 2 years ago

sagarth commented 2 years ago

Test: pt2pt/probe_unexp 4

Fails under SEP + multi-vci but passes under single vci.

Test prints No-Errors and is stuck during finalize. Out of 4 ranks, 2 ranks were stuck. One rank was stuck in MPIDI_NM_mpi_finalize_hook whereas another was stuck in MPIDU_Init_shm_finalize.

Replicated locally with psm2 and netmod on a single node. Jenkins showed problem with auto as well.

hzhou commented 2 years ago

@sagarth Please add in the complete configure and runtime information. Also please attach the link or copy the output of the Jenkins test if you have it. Thanks!

sagarth commented 2 years ago

Additional relevant details for build config: --disable-ofi-domain --with-ch4-shmmods=none

Complete config:

$src/configure -C --prefix=${inst} --disable-perftest --with-custom-version-string=drop42 --disable-ofi-domain --disable-ft-tests -with-fwrapname=mpigf -with-file-system=ufs+nfs -enable-timer-type=linux86_cycle -with-assert-level=0 -enable-shared -enable-static -enable-error-messages=yes -enable-large-tests -enable-strict -enable-collalgo-tests -enable-izem-queue -with-zm-prefix=yes --with-default-ofi-provider=psm2 -enable-fast=none -enable-g=all -enable-timing=runtime -enable-error-checking=all -enable-debuginfo -with-device=ch4:ofi -enable-handle-allocation=default -enable-threads=multiple -enable-ch4-netmod-inline=no -enable-ch4-shm-inline=no -enable-mpit-pvars=all --with-ch4-shmmods=none --enable-ch4-mt=runtime -enable-thread-cs=per-vci --with-1libfabric=embedded --disable-spawn --with-ch4-max-vcis=4 --without-ze 'MPICHLIB_CFLAGS=-ggdb  -ggdb -Wall -mtune=generic -std=gnu99' 'MPICHLIB_CXXFLAGS=-ggdb  -ggdb -Wall -mtune=generic' 'MPICHLIB_FCFLAGS=-ggdb  -ggdb -mtune=generic -ffree-line-length-256' 'MPICHLIB_F77FLAGS=-ggdb  -ggdb -mtune=generic -ffree-line-length-256' 'MPICHLIB_LDFLAGS=-O0 -L/opt/intel/csr/lib64 -L/usr/lib64 -L/usr/lib64 -L/lib64 -mtune=generic'

Runtime details:

MPIR_CVAR_DEFAULT_THREAD_LEVEL = MPI_THREAD_MULTIPLE
MPIR_CVAR_ODD_EVEN_CLIQUES = 1
MPIR_CVAR_CH4_OFI_MAX_VNIS = 4
MPIR_CVAR_CH4_OFI_MAX_RMA_SEP_CTX = 4
MPIR_CVAR_CH4_MT_MODEL = direct
MPIR_CVAR_CH4_NUM_VCIS = 4
MPIR_CVAR_CH4_OFI_EAGER_MAX_MSG_SIZE = 16384

mpirun -n 4 pt2pt/probe_unexp

Output: No Errors Then the test hangs.

hzhou commented 2 years ago

@sagarth I am not able to reproduce this. There are too many irrelevant configure options and variables. Could you narrow it down until we find the minimum condition? That should help me pin it down.

sagarth commented 2 years ago

This failure was under the same config as: https://github.com/pmodels/mpich/issues/5627. I think https://github.com/pmodels/mpich/pull/5634 should have fixed this issue as well. I did some local testing with those patches and this test passed.

hzhou commented 2 years ago

This failure was under the same config as: #5627. I think #5634 should have fixed this issue as well. I did some local testing with those patches and this test passed.

Sounds good. Feel free to close the issue once you confirm with testing.

sagarth commented 2 years ago

It is passing. However, I noticed one difference that under 4 vcis (compared to 1 vci), it is taking longer to complete, after printing No Error.

hzhou commented 2 years ago

Seems the issue has been fixed from the last conversation. Closing.