pmodels / armci-mpi

An implementation of ARMCI using MPI one-sided communication (RMA)
https://wiki.mpich.org/armci-mpi/index.php/Main_Page
Other
13 stars 7 forks source link

Intel MPI 2021.10 Rget_accumulate false positive in error checking #55

Open jeffhammond opened 1 month ago

jeffhammond commented 1 month ago

This is not our bug and we will not fix it, but the details are documented here for posterity.

There is a bug in Intel MPI 2021.10 and Cray MPI 8.1.29 when using request-based RMA (https://github.com/pmodels/armci-mpi/pull/53). It could be an MPICH bug in the argument checking macros but I tested MPICH 4.2 extensively today and it does not appear there.

In MPI_Rget_accumulate(NULL, 0, MPI_BYTE, .. , MPI_NO_OP, ..), the implementation incorrectly says that MPI_BYTE has not been committed.

Reproducer by running this in e.g. /tmp:

. /opt/intel/oneapi/setvars.sh  --force
git clone --depth 1 https://github.com/jeffhammond/armci-mpi -b request-based-rma
cd armci-mpi
./autogen.sh
mkdir build
cd build
../configure CC=/opt/intel/oneapi/mpi/2021.10.0/bin/mpicc --enable-g
make -j checkprogs
export ARMCI_VERBOSE=1
mpirun -n 4 ./tests/contrib/armci-test # this fails
export ARMCI_RMA_ATOMICITY=0 # this disables MPI_Rget_accumulate(MPI_NO_OP)
mpirun -n 4 ./tests/contrib/armci-tes # this works

It fails here:

Testing non-blocking gets and puts
local[0:2] -> remote[0:2] -> local[1:3]
local[1:3,0:0] -> remote[1:3,0:0] -> local[1:3,1:1]
local[2:3,0:1,2:3] -> remote[2:3,0:1,2:3] -> local[1:2,0:1,2:3]
local[2:2,1:1,3:5,1:5] -> remote[4:4,0:0,1:3,1:5] -> local[3:3,1:1,1:3,2:6]
local[1:4,1:1,0:0,2:6,0:2] -> remote[1:4,2:2,1:1,2:6,1:3] -> local[0:3,1:1,5:5,2:6,2:4]
local[1:4,0:2,1:7,5:6,0:6,1:2] -> remote[0:3,0:2,1:7,7:8,0:6,0:1] -> local[0:3,0:2,0:6,3:4,0:6,0:1]
local[3:4,0:1,0:0,5:7,5:6,0:1,0:1] -> remote[1:2,0:1,0:0,5:7,2:3,0:1,0:1] -> local[0:1,0:1,4:4,2:4,3:4,0:1,0:1]
Abort(336723971) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(218): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x60d4e105cfd0, result_count=1, dtype=USER<contig>, target_rank=3, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000001, 0x7ffc044c9558) failed
PMPI_Rget_accumulate(159): Datatype has not been committed

MPI_BYTE does not need to be committed.

This is a patch that works around the Intel MPI bug, and therefore reveals the problem:

diff --git a/src/gmr.c b/src/gmr.c
index 129b97c..acf8539 100644
--- a/src/gmr.c
+++ b/src/gmr.c
@@ -603,7 +603,9 @@ int gmr_get_typed(gmr_t *mreg, void *src, int src_count, MPI_Datatype src_type,
     MPI_Request req = MPI_REQUEST_NULL;

     if (ARMCII_GLOBAL_STATE.rma_atomicity) {
-        MPI_Rget_accumulate(NULL, 0, MPI_BYTE,
+        // using the source type instead of MPI_BYTE works around an Intel MPI 2021.10 bug...
+        MPI_Rget_accumulate(NULL, 0, src_type /* MPI_BYTE */,
                             dst, dst_count, dst_type, grp_proc,
                             (MPI_Aint) disp, src_count, src_type,
                             MPI_NO_OP, mreg->window, &req);

The setting ARMCI_RMA_ATOMICITY=0 disables this code path in favor of the following MPI_Get, which works just fine with the same arguments except for the (NULL,0,MPI_BYTE) tuple, which of course is unused.

jeffhammond commented 1 month ago

This bug is in Cray MPI (/opt/cray/pe/mpich/8.1.29), too, so it must be from MPICH.

Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(235): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x37899d0, result_count=1, dtype=USER<contig>, target_rank=0, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000002, 0x7fff7159aafc) failed
PMPI_Rget_accumulate(170): Datatype has not been committed
srun: error: nid002439: task 3: Exited with exit code 255
srun: Terminating StepId=8094140.0
slurmstepd: error: *** STEP 8094140.0 ON nid002438 CANCELLED AT 2024-10-03T17:53:15 ***
srun: error: nid002438: tasks 0-2: Exited with exit code 255