Open jeffhammond opened 1 month ago
This bug is in Cray MPI (/opt/cray/pe/mpich/8.1.29
), too, so it must be from MPICH.
Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(235): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x37899d0, result_count=1, dtype=USER<contig>, target_rank=0, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000002, 0x7fff7159aafc) failed
PMPI_Rget_accumulate(170): Datatype has not been committed
srun: error: nid002439: task 3: Exited with exit code 255
srun: Terminating StepId=8094140.0
slurmstepd: error: *** STEP 8094140.0 ON nid002438 CANCELLED AT 2024-10-03T17:53:15 ***
srun: error: nid002438: tasks 0-2: Exited with exit code 255
This is not our bug and we will not fix it, but the details are documented here for posterity.
There is a bug in Intel MPI 2021.10 and Cray MPI 8.1.29 when using request-based RMA (https://github.com/pmodels/armci-mpi/pull/53). It could be an MPICH bug in the argument checking macros but I tested MPICH 4.2 extensively today and it does not appear there.
In
MPI_Rget_accumulate(NULL, 0, MPI_BYTE, .. , MPI_NO_OP, ..)
, the implementation incorrectly says that MPI_BYTE has not been committed.Reproducer by running this in e.g. /tmp:
It fails here:
MPI_BYTE does not need to be committed.
This is a patch that works around the Intel MPI bug, and therefore reveals the problem:
The setting ARMCI_RMA_ATOMICITY=0 disables this code path in favor of the following MPI_Get, which works just fine with the same arguments except for the (NULL,0,MPI_BYTE) tuple, which of course is unused.