pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
541 stars 280 forks source link

bug: ch4 memory corruption in MPI_Datatype management exposed by GA #2516

Open pkcoff opened 7 years ago

pkcoff commented 7 years ago

The NWChem application utilizes global arrays (GA) communication for PGAS. NWChem has historically had very bad performance on BGQ using mpich-pamid for the GA, so I would like to get a GA build on BGQ utilizing the CH4-OFI to see if we can get some better performance. There is 1 INCITE project planning to use NWChem this year on BGQ, as well as an ADSP and Aurora-ESP project on KNL, so going forward GA will need to run effectively with CH4-OFI. GA is built on top of ARMCI-MPI for the actual MPI communication, and Jeff Hammond from Intel has an mpi3rma version of this here that I used for the latest and greatest performance: https://wiki.mpich.org/armci-mpi/index.php/NWChem

However, during NWChem testing I uncovered a memory corruption issue, and tracked it down to the ARMCI-MPI layer and what looks like a problem with MPI_Datatype management in CH4. I have isolated a particular ARMCI-MPI regression test that exposes the issue. Running test_mpi_dim ARMCI-MPI test program on just 2 nodes / ranks:

qsub -A aurora_app -t 29 --nodecount 2 --mode c1 --env RUNJOB_LABEL=short:ARMCI_VERBOSE=1 ./test_mpi_dim

I got a sig-6 for an invalid pointer in src/mpid/ch4/netmod/ofi/ofi_impl.h line 291 - the MPL_free in this function:

MPL_STATIC_INLINE_PREFIX void MPIDI_OFI_win_datatype_unmap(MPIDI_OFI_win_datatype_t * dt)
{
    if (dt && dt->map && (dt->map != &dt->__map))
        MPL_free(dt->map);
}

The full stack trace is here:

abort
/admin_home/ascovel/toolchain/bgsys/drivers/V1R2M4/ppc64/toolchain-4.7.2/gnu/glibc-2.17/stdlib/abort.c:75
__libc_message
/admin_home/ascovel/toolchain/bgsys/drivers/V1R2M4/ppc64/toolchain-4.7.2/gnu/glibc-2.17/libio/../sysdeps/unix/sysv/linux/libc_fatal.c:196
malloc_printerr
/admin_home/ascovel/toolchain/bgsys/drivers/V1R2M4/ppc64/toolchain-4.7.2/gnu/glibc-2.17/malloc/malloc.c:4912
_int_free
/admin_home/ascovel/toolchain/bgsys/drivers/V1R2M4/ppc64/toolchain-4.7.2/gnu/glibc-2.17/malloc/malloc.c:3766
MPIDI_OFI_win_datatype_unmap
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:291
MPIDI_OFI_win_request_complete
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:301
MPIDI_OFI_rma_done_event
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_events.h:418
MPIDI_OFI_win_progress_fence
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_win.h:294
MPIDI_NM_mpi_win_flush_local
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_win.h:1351
MPID_Win_flush_local
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/src/ch4_win.h:315
PMPI_Win_flush_local
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpi/rma/win_flush_local.c:108
test_dim
/projects/aurora_app/nwchem/armci-mpi3rma/build/../armci-mpi/tests/mpi/test_mpi_dim.c:496
main
/projects/aurora_app/nwchem/armci-mpi3rma/build/../armci-mpi/tests/mpi/test_mpi_dim.c:540

You should be able to reproduce this using another ofi provider since this is a datatype that is created and managed in the CH4 layer I doubt this is a BGQ issue. Here is how I built this test on bgq, modify accordingly:

mkdir armci-mpi3rma
cd armci-mpi3rma
git clone http://git.mpich.org/armci-mpi.git
cd armci-mpi
git checkout mpi3rma
./autogen.sh
cd ../
mkdir build
cd build
export MPICH_CC=/soft/compilers/ibmcmp-may2015/vac/bg/12.1/bin/bgxlc_r
../armci-mpi/configure CC=/projects/aurora_app/mpichinstall/gnu472-opt/bin/mpicc CFLAGS="-g -O3" --prefix=/path/to/install
make -j16 install
make checkprogs

and then tests/mpi/test_mpi_dim is the binary I ran above.

hajimefu commented 7 years ago

@pkcoff Thank you for reporting. @raffenet can you assign this to me? @mblockso you might want to jump in too because I think you saw another crash here before.

raffenet commented 7 years ago

Is this still an issue? If so, can we get a small reproducer to help debug the problem?

hajimefu commented 6 years ago

@pkcoff I'm not sure if this is still an issue, but our recent patches #2876 completely rewrote the RMA datatype handling (MPIDI_OFI_win_datatype_unmap no longer exists, for example), so it may be worth trying. Also we've recently fixed several failures in ARMCI-MPI test cases.

CC: @shawnccx