Closed pkcoff closed 7 years ago
I should also add that if I run on just 1 node in c4 then all the ranks hang in the barrier, but if i run on 4 nodes in c1 then ranks 1-3 fall thru the barrier but rank 0 is still stuck in the barrier, so the intranode code path also makes a difference in the nature of the hang.
So the problem has something to do with the progress thread and completion queue processing when using subcomms and then going back to using MPI_COMM_WORLD.
What progress thread are you referring to since the bgq provider doesn't use one with FI_PROGRESS_MANUAL
?
i meant to say progress engine from mpich, not using the thread
here are the src file and stachtraces attached acmecommhang.zip
@pkcoff this test completes on my laptop with 4 ranks with the sockets provider. It completes in a default config, and also with the local debug options MPIR_CVAR_NOLOCAL=1
and MPIR_CVAR_ODD_EVEN_CLIQUES=1
.
OK, probably a bgq issue I am continuing to investigate. thanks.
Ugh it is a heisenbug, with full trace turned on it works fine
Has something to do with MPI_ANY_SOURCE and the irecv, if I give the source ranks it works
MPI_ANY_SOURCE
handling at the CH4 level is not quite right. Currently we create 2 recv requests - one for the netmod, one for shm. When one completes, the other is not freed/canceled as it should be. It might be hanging waiting for one or more of those orphaned requests to complete?
I think the standard bgq config has the shm netmod disabled, correct? If so, then maybe this disabled-shm code path is not well tested?
There is definitely a timing element to this as well --- if I keep the MPI_ANY_SOURCE but just do a sleep(5) in front of the barrier then it works too. The barrier implementation in ch4 does a bcast via MPIR_Bcast_binomial and in there does a bunch of MPIC_Recv-MPIC_Send, I think the irecv MPI_ANY_SOURCE in rank 0 from my previous irecv-send-waitall loop is matching one of the barrier sends. Which then brings into question the is_match function we're using for the recv contexts, basically I think our current logic disregards the tag in the case of MPI_ANY_SOURCE (so if MPI_ANY_SOURCE is specified the context is considered a match even if the tags differ) I think that is wrong we still need to look at the tag?
The barrier send uses this MPIR_BCAST_TAG to prevent this from happening I think:
mpi_errno = MPIC_Send(tmp_buf,nbytes,MPI_BYTE,dst, MPIR_BCAST_TAG,comm_ptr, errflag);
I changed our matching function to check the tag independently of the source in the case of any_source and that seems to fix this problem. Need to confirm some things with Mike, but can't believe we could be this fundamentally broken and pass the test suite and a run a bunch of apps successfully...
Great! It would be good to get this patch into master soon.
Mike concurred, ran the testsuite, NEK that was broken now works, I have an open pr now waiting on the ofiwg people to approve it: https://github.com/ofiwg/libfabric/pull/2773
Fixed via https://github.com/ofiwg/libfabric/pull/2773.
The ClimateEnergy_2 INCITE project on bgq has a code name ACME that hangs with pami so they are trying ofi. However there appears to be a bug they hit before they can even get to the pami-hang part resulting in a different hang that I have been able to write a simple testcase to reproduce. I have attached the test and stack trace, essentially this test creates a subcomm for every rank in MPI_COMM_WORLD except rank 0 and then has that subcomm send its size to rank 0 (irecv-send-waitall loop) and then there is a barrier after the waitall. The hang occurs in the barrier, but it isn't specifically an issue with the barrier code -- in the actual acme code there isn't a barrier but rather other collectives and send-recv operations with MPI_COMM_WORLD that get hung. On BGQ i can reproduce this on just 4 ranks. So the problem has something to do with the progress thread and completion queue processing when using subcomms and then going back to using MPI_COMM_WORLD. I think the problem is in CH4 but I'm not 100% sure, (this is with FI_PROGESS_MANUAL set in ofi) could someone run this test on sockets / linux and see if it also hangs there? Thanks. Due to technical issues with github I can't attach any files at the moment, for now here is the source code I'll attach the files and log once things are working again: