Open pkcoff opened 7 years ago
@pkcoff Thank you for reporting. @raffenet can you assign this to me? @mblockso you might want to jump in too because I think you saw another crash here before.
Is this still an issue? If so, can we get a small reproducer to help debug the problem?
@pkcoff I'm not sure if this is still an issue, but our recent patches #2876 completely rewrote the RMA datatype handling (MPIDI_OFI_win_datatype_unmap
no longer exists, for example), so it may be worth trying. Also we've recently fixed several failures in ARMCI-MPI test cases.
CC: @shawnccx
The NWChem application utilizes global arrays (GA) communication for PGAS. NWChem has historically had very bad performance on BGQ using mpich-pamid for the GA, so I would like to get a GA build on BGQ utilizing the CH4-OFI to see if we can get some better performance. There is 1 INCITE project planning to use NWChem this year on BGQ, as well as an ADSP and Aurora-ESP project on KNL, so going forward GA will need to run effectively with CH4-OFI. GA is built on top of ARMCI-MPI for the actual MPI communication, and Jeff Hammond from Intel has an mpi3rma version of this here that I used for the latest and greatest performance: https://wiki.mpich.org/armci-mpi/index.php/NWChem
However, during NWChem testing I uncovered a memory corruption issue, and tracked it down to the ARMCI-MPI layer and what looks like a problem with MPI_Datatype management in CH4. I have isolated a particular ARMCI-MPI regression test that exposes the issue. Running test_mpi_dim ARMCI-MPI test program on just 2 nodes / ranks:
I got a sig-6 for an invalid pointer in src/mpid/ch4/netmod/ofi/ofi_impl.h line 291 - the MPL_free in this function:
The full stack trace is here:
You should be able to reproduce this using another ofi provider since this is a datatype that is created and managed in the CH4 layer I doubt this is a BGQ issue. Here is how I built this test on bgq, modify accordingly:
and then tests/mpi/test_mpi_dim is the binary I ran above.