CH4/OFI: Hang in NM Progress thread exposed by send-recv with subcomms and MPI_COMM_WORLD

pkcoff commented 7 years ago

The ClimateEnergy_2 INCITE project on bgq has a code name ACME that hangs with pami so they are trying ofi. However there appears to be a bug they hit before they can even get to the pami-hang part resulting in a different hang that I have been able to write a simple testcase to reproduce. I have attached the test and stack trace, essentially this test creates a subcomm for every rank in MPI_COMM_WORLD except rank 0 and then has that subcomm send its size to rank 0 (irecv-send-waitall loop) and then there is a barrier after the waitall. The hang occurs in the barrier, but it isn't specifically an issue with the barrier code -- in the actual acme code there isn't a barrier but rather other collectives and send-recv operations with MPI_COMM_WORLD that get hung. On BGQ i can reproduce this on just 4 ranks. So the problem has something to do with the progress thread and completion queue processing when using subcomms and then going back to using MPI_COMM_WORLD. I think the problem is in CH4 but I'm not 100% sure, (this is with FI_PROGESS_MANUAL set in ofi) could someone run this test on sockets / linux and see if it also hangs there? Thanks. Due to technical issues with github I can't attach any files at the moment, for now here is the source code I'll attach the files and log once things are working again:

#include "mpi.h"
#include <stdio.h>
#include <malloc.h>

int main(int argc, char *argv[])
{
    int errs = 0;
    int rank, size, color, srank;
    MPI_Comm world_comm, *subcomm;

    MPI_Init(&argc, &argv);
    MPI_Comm_dup(MPI_COMM_WORLD, &world_comm);

    MPI_Comm_rank(world_comm, &rank);
    MPI_Comm_size(world_comm, &size);
    MPI_Group world_group, *subgroup;
    int range_triplet[3];

    MPI_Comm_group(world_comm, &world_group);
    subcomm = (MPI_Comm *) malloc(sizeof(MPI_Comm) * size);
    subgroup = (MPI_Group *) malloc(sizeof(MPI_Group) * size);

    int i;
    for (i=0;i<(size-1);i++) {
      range_triplet[0] = i+1;
      range_triplet[1] = i+1;
      range_triplet[2] = 1;
      MPI_Group_range_incl(world_group, 1, &range_triplet,&(subgroup[i]));
      MPI_Comm_create(world_comm,subgroup[i],&(subcomm[i]));
    }
    if (rank ==0)
      printf("created %d subcomms\n",size-1);

    int *root_nprocs;
    root_nprocs = (int *) malloc(sizeof(int) * size);

    MPI_Request *subrequest;
    subrequest = (MPI_Request *) malloc(sizeof(MPI_Request) * size);
    MPI_Status *substatus;
    substatus = (MPI_Status *) malloc(sizeof(MPI_Status) * size);

    if (rank == 0) {
      for (i=0;i<size-1;i++) {
        MPI_Irecv(&(root_nprocs[i]),1,MPI_INT,MPI_ANY_SOURCE,i,world_comm,&(subrequest[i]));
      }
    }
    for (i=0;i<size-1;i++) {
      int subrank,subsize;
      if (subcomm[i] != MPI_COMM_NULL) {
      MPI_Comm_rank(subcomm[i], &subrank);
      MPI_Comm_size(subcomm[i], &subsize);
      if (subrank == 0) {
        printf("world rank %d which is rank 0 on subcomm %d sending subcomm size %d to world root 0\n",rank,i,subsize);
        MPI_Send(&subsize,1,MPI_INT,0,i,world_comm);
      }
      }
    }

    if (rank == 0)
      MPI_Waitall((size-1),subrequest,substatus);

    MPI_Barrier(world_comm);

    if (rank == 0) 
      printf("test complete\n");
}

pkcoff commented 7 years ago

I should also add that if I run on just 1 node in c4 then all the ranks hang in the barrier, but if i run on 4 nodes in c1 then ranks 1-3 fall thru the barrier but rank 0 is still stuck in the barrier, so the intranode code path also makes a difference in the nature of the hang.

mblockso commented 7 years ago

So the problem has something to do with the progress thread and completion queue processing when using subcomms and then going back to using MPI_COMM_WORLD.

What progress thread are you referring to since the bgq provider doesn't use one with FI_PROGRESS_MANUAL?

pkcoff commented 7 years ago

i meant to say progress engine from mpich, not using the thread

pkcoff commented 7 years ago

here are the src file and stachtraces attached acmecommhang.zip

raffenet commented 7 years ago

@pkcoff this test completes on my laptop with 4 ranks with the sockets provider. It completes in a default config, and also with the local debug options MPIR_CVAR_NOLOCAL=1 and MPIR_CVAR_ODD_EVEN_CLIQUES=1.

pkcoff commented 7 years ago

OK, probably a bgq issue I am continuing to investigate. thanks.

pkcoff commented 7 years ago

Ugh it is a heisenbug, with full trace turned on it works fine

pkcoff commented 7 years ago

Has something to do with MPI_ANY_SOURCE and the irecv, if I give the source ranks it works

raffenet commented 7 years ago

MPI_ANY_SOURCE handling at the CH4 level is not quite right. Currently we create 2 recv requests - one for the netmod, one for shm. When one completes, the other is not freed/canceled as it should be. It might be hanging waiting for one or more of those orphaned requests to complete?

mblockso commented 7 years ago

I think the standard bgq config has the shm netmod disabled, correct? If so, then maybe this disabled-shm code path is not well tested?

pkcoff commented 7 years ago

There is definitely a timing element to this as well --- if I keep the MPI_ANY_SOURCE but just do a sleep(5) in front of the barrier then it works too. The barrier implementation in ch4 does a bcast via MPIR_Bcast_binomial and in there does a bunch of MPIC_Recv-MPIC_Send, I think the irecv MPI_ANY_SOURCE in rank 0 from my previous irecv-send-waitall loop is matching one of the barrier sends. Which then brings into question the is_match function we're using for the recv contexts, basically I think our current logic disregards the tag in the case of MPI_ANY_SOURCE (so if MPI_ANY_SOURCE is specified the context is considered a match even if the tags differ) I think that is wrong we still need to look at the tag?

pkcoff commented 7 years ago

The barrier send uses this MPIR_BCAST_TAG to prevent this from happening I think:

mpi_errno = MPIC_Send(tmp_buf,nbytes,MPI_BYTE,dst, MPIR_BCAST_TAG,comm_ptr, errflag);

pkcoff commented 7 years ago

I changed our matching function to check the tag independently of the source in the case of any_source and that seems to fix this problem. Need to confirm some things with Mike, but can't believe we could be this fundamentally broken and pass the test suite and a run a bunch of apps successfully...

raffenet commented 7 years ago

Great! It would be good to get this patch into master soon.

pkcoff commented 7 years ago

Mike concurred, ran the testsuite, NEK that was broken now works, I have an open pr now waiting on the ofiwg people to approve it: https://github.com/ofiwg/libfabric/pull/2773

raffenet commented 7 years ago

Fixed via https://github.com/ofiwg/libfabric/pull/2773.

pmodels / mpich

CH4/OFI: Hang in NM Progress thread exposed by send-recv with subcomms and MPI_COMM_WORLD #2565