Open inducer opened 1 year ago
Maybe this could be a workaround - we disable mpi4py's mprobe in mirgecom due to a similar crash (observed in Spectrum MPI, https://github.com/illinois-ceesd/mirgecom/issues/132):
Thanks for the tip! Though it seems that setting recv_mprobe = False
does not avoid this particular issue.
Exciting news: while I don't know what exactly the issue is, OpenMPI 4.1.5-1 seems to include a fix that makes it work properly with the previously-offending version of libfabric1.
Sample CI failure: https://gitlab.tiker.net/inducer/meshmode/-/jobs/533461
Similar failure in grudge: https://gitlab.tiker.net/inducer/grudge/-/jobs/533485
Sample traceback:
Downgrading to libfabric (see here) appears to resolve this.
This is the code in mpi4py that ultimately fails, it's a matched receive (
mrecv
).@majosm Got any ideas? (Pinging you since the two of us last touched this code.)