open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

mpi4py test failures with ompi@main #12929

Open dalcinl opened 5 days ago

dalcinl commented 5 days ago

Nightly mpi4py tests with ompi@main have been failing from time to time. Test pass after a re-run, so the things is no easily reproducible. The latest failure produces the following output.

testCreateFromGroup (test_comm.TestCommSelfDup.testCreateFromGroup) ... [fv-az654-539:143367] PMIX ERROR: PMIX_ERR_UNPACK_READ_PAST_END_OF_BUFFER in file client/pmix_client_group.c at line 1376

Full logs here

rhc54 commented 5 days ago

Have noticed some semi-random errors on PRs, but haven't seen that particular error message before. I suspect that specific error may be indicative of the growing disconnect between the PMIx master branch and the OMPI fork of PRRTE. I've tried to start some discussion over here about it, but due to Supercomputing and holidays it will take some time to address the problem.

The line number indicates that the PMIx submodule isn't current - indeed, a quick glance shows it is far behind the head of the master branch. I can post a PR to update it, just to see if it impacts anything.

However, the overall problem could have nothing to do with PMIx or PRRTE. 🤷‍♂️ Difficult to say.

rhc54 commented 5 days ago

Sigh - can't update PMIx as the OMPI PRRTE fork is simply too out-of-sync. 🤷‍♂️ Not much I can help with, I'm afraid.

rhc54 commented 5 days ago

FWIW: looking at the OMPI nightly regression tests (their own test suite), it appears that the one-sided tests are uniformly failing in both the main and v5.0 branches. Seeing the same failure signatures that are being reported elsewhere by Debian.

Interestingly enough, I'm not seeing the failure you are reporting here - but given it is intermittent, that may be simply luck.