sandialabs / portals4

Portals is a low-level network API for high-performance networking on high-performance computing systems developed by Sandia National Laboratories, Intel Corporation, and the University of New Mexico. The Portals 4 Reference Implementation is a complete implementation of Portals 4, with transport over InfiniBand VERBS and UDP. Shared memory transport is available as an optimization, including Linux KNEM support. The Portals 4 reference implementation is supported on both modern 64 bit Linux and 64 bit Mac OS X. The reference implementation has been developed by Sandia National Laboratories, Intel Corporation, and System Fabric Works. For more information on the Portals 4 standard, please see the Portals 4 page.
https://www.sandia.gov/portals/
Other
36 stars 17 forks source link

Open MPI MPI_Comm_split crashes with Portals4 #82

Open tkordenbrock opened 5 years ago

tkordenbrock commented 5 years ago

From open-mpi/ompi#7217:

Details of the problem

Calling MPI_Comm_split causes an immediate crash with the assertion:

../../../src/ib/ptl_ct.c:567: ct_check: Assertion `buf->type == BUF_TRIGGERED' failed.
reduce: ../../../src/ib/ptl_ct.c:567: ct_check: Assertion `buf->type == BUF_TRIGGERED' failed.
[bold-node013:14014] *** Process received signal ***
[bold-node013:14014] Signal: Aborted (6)
[bold-node013:14014] Signal code:  (-6)
[bold-node013:14014] [ 1] /home/pt2/openmpi-4.0.1/_build/../_install/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fbc4ba705ac]
[bold-node013:14014] [ 2] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(+0x3105d)[0x7fbc4c80205d]
[bold-node013:14014] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7fb7c5f82266]
[bold-node013:14014] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7fb7c5f82312]
[bold-node013:14014] [ 5] /home/pt2/portals4/_build/../_install/lib/libportals.so.4(+0x84ab)> [0x7fb7b9f854ab]
[bold-node013:14014] [ 6] /home/pt2/portals4/_build/../_install/lib/libportals.so.4(PtlCTFree+0xd7)[0x7fb7b9f84cbb]
[bold-node013:14014] [ 7] /home/pt2/openmpi-4.0.1/_build/../_install/lib/openmpi/mca_coll_portals4.so(ompi_coll_portals4_iallreduce_intra_fini+0x15b)[0x7fb7b239c9db]
[bold-node013:14014] [ 8] /home/pt2/openmpi-4.0.1/_build/../_install/lib/openmpi/mca_coll_portals4.so(+0x40c5)[0x7fb7b239d0c5]
[bold-node013:14014] [ 9] /home/pt2/openmpi-4.0.1/_build/../_install/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fb7c57bb5ac]
[bold-node013:14014] [10] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(+0x3105d)[0x7fb7c654d05d]
[bold-node013:14014] [11] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(ompi_comm_nextcid+0x29)[0x7fb7c654ebc9]
[bold-node013:14014] [12] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(ompi_comm_split+0x3ea)[0x7fb7c654aaca]
[bold-node013:14014] [13] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(MPI_Comm_split+0xa8)[0x7fb7c65854d8]
[bold-node013:14014] [14] ./reduce[0x40089c]
[bold-node013:14014] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fb7c5f75b45]
[bold-node013:14014] [16] ./reduce[0x400769]
[bold-node013:14014] *** End of error message ***

Since the problem appears to be with MPI_Iallreduce, I tried running this sample program:

#include <mpi.h>
int main(int argc, char** argv) {
        MPI_Init(&argc, &argv);
#if IALLREDUCE
        MPI_Request rq;
        int send, recv;
        MPI_Iallreduce(&send, &recv, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &rq);
        MPI_Wait(&rq, MPI_STATUS_IGNORE);
#elif ALLREDUCE
        int send, recv;
        MPI_Allreduce(&send, &recv, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
#else
        MPI_Comm out;
        MPI_Comm_split(MPI_COMM_WORLD, rank/2, rank, &out);
#endif
        MPI_Finalize();
}

MPI_Allreduce works properly, but MPI_Iallreduce and MPI_Comm_split fail. MPI_Iallreduce crashes with a similar stack trace as the one above.