BGQ: CH4 OFI MPI_Iprobe does not succeed when probing self

pkcoff commented 7 years ago

The mpich testsuite test  pt2pt/isendselfprobe.c 1   hangs stuck in this loop:

while (!flag) {
        MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status);
    }

basically probing itself, the flag is never set to true indicating the a method of reception for the message is never resolved.

raffenet commented 7 years ago

@pkcoff is this still an issue? Commit 7313db77bb157a663550bccdf5950aec7997f131 may have fixed this by adding a progress poke to MPID_Iprobe.

pkcoff commented 7 years ago

Hey Ken - I should have been more diligent to follow up on this, we have a user now who has run into this issue which still happens with latest master, stuck in progress:

/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/ofi/libfabric/prov/bgq/include/rdma/fi_direct_eq.h:508
fi_bgq_cq_read_FI_CQ_FORMAT_TAGGED_0
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/ofi/libfabric/prov/bgq/include/rdma/fi_direct_eq.h:534
MPIDI_NM_progress
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:35
MPID_Progress_test
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/src/ch4_progress.h:42
MPIDI_OFI_do_iprobe
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_probe.h:65
MPIDI_NM_mpi_iprobe
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/netmod/include/../ofi/ofi_probe.h:149
MPID_Iprobe
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpid/ch4/src/ch4_probe.h:206
PMPI_Iprobe
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/src/mpi/pt2pt/iprobe.c:106
main
/home/pkcoff/development/github/OFI-BGQ-BuildEnv/mpi/mpich/test/mpi/pt2pt/isendselfprobe.c:28
generic_start_main
/admin_home/ascovel/toolchain/bgsys/drivers/V1R2M4/ppc64/toolchain-4.7.2/gnu/glibc-2.17/csu/../csu/libc-start.c:258
__libc_start_main
/admin_home/ascovel/toolchain/bgsys/drivers/V1R2M4/ppc64/toolchain-4.7.2/gnu/glibc-2.17/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
??
??:0

mblockso commented 7 years ago

Does MPICH/CH4 intercept the "send to self" path? This is not necessary for bgq ofi provider as it already supports send-to-self and loopback (this is why we disable shared memory in the bgq builds).

Is there a way to disable the send-to-self branch and code path?

raffenet commented 7 years ago

CH4 does not currently have a "self" send protocol. Only local and non-local.

raffenet commented 7 years ago

If SHM is disabled, OFI should be handling everything, AFAICT.

mblockso commented 7 years ago

If that is the case then we should start looking for the bug in the bgq ofi provider. I'm not sure how this "self" code path is any different than the "loopback" code path. Maybe it is because of some funky interaction between fi_recv/fi_send/cq_read all being in the same process?

@pkcoff could you instrument the code to print the descriptor as it is injected and then the packets as they are received? or does that change in timing hide the bug? I'm wondering if the packets are being sent/delivered correctly or if it is a reporting problem with fi_cq_read, etc.

pkcoff commented 7 years ago

OK, the user is actually probing other ranks but this test probably is illustrating the same problem and is simple. I'll debug it from the ofi send and let you know. thanks.

pkcoff commented 7 years ago

Here is excerpt from the simple test I wrote / modified for this based on isendselfprobe.c but easier to debug for me just doing MPI_Probe instead which blocks until the match – rank 0 does mpi_isend to rank 1, rank 1 probes for msg from rank 0 first before receiving it (works fine with pami build): eg:

if (rank == 0) { MPI_Isend(&sendMsg, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); } else if (rank == 1) {

        MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
    MPI_Get_count(&status, MPI_INT, &count);
    if (count != 1) {
        errs++;
    }
    MPI_Recv(&recvMsg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
    if (recvMsg != 123) {
        errs++;
    }
}
MPI_Barrier(MPI_COMM_WORLD);

So what happens at mpich layer is winds up calling MPIDI_OFI_do_iprobe in ofi_probe.h which then calls a fi_trecvmsg and waits for request object to be set:

MPIDI_OFI_CALL(fi_trecvmsg
               (MPIDI_OFI_EP_RX_TAG(0), &msg,
                peek_flags | FI_PEEK | FI_COMPLETION | (MPIDI_OFI_ENABLE_DATA ? FI_REMOTE_CQ_DATA : 0) ), trecvmsg);
MPIDI_OFI_PROGRESS_WHILE(MPIDI_OFI_REQUEST(rreq, util_id) == MPIDI_OFI_PEEK_START);

But that never happens and it gets stuck in MPIDI_OFI_PROGRESS_WHILE when running in manual progress mode. Down in ofi on the target rank 1 the fi_trecvmsg is implemented as fi_bgq_trecvmsg_generic which basically calls process_mfifo_context with just FI_PEEK for the msg flag with the context which will match the incoming tag from the isend in rank 0, which will end up in the unexpected queue, and needs to wait until the context does match. The problem is the incoming isend msg has to get processed by the rfifo and moved to the unexpected queue first, and that isn’t what happens because in the case of manual mode how mpich makes progress on the target – it processes the mpi_probe first and then the progress engine polls the rfifos. So currently process_mfifo_context just posts an error completion event when it can’t match but i'm not sure how this affects the request object, it should somehow just get set to MPIDI_OFI_PEEK_NOT_FOUND, otherwise maybe down in ofi one solution could be to put it on the match queue but then have some flag that says don’t consume it. If I run in auto-progress mode and put a sleep(10) in front the mpi_probe then it works since the isend messages have time to get onto the unexpected queue before the fi_trecvmsg runs, so the ofi code is good in that sense.

mblockso commented 7 years ago

I don't understand how the bgq provider behavior is incorrect. This is what the fi_tagged man page (https://ofiwg.github.io/libfabric/master/man/fi_tagged.3.html) states for the FI_PEEK flag:

FI_PEEK The peek flag may be used to see if a specified message has arrived. A peek request is often useful on endpoints that have provider allocated buffering enabled (see fi_rx_attr total_buffered_recv). Unlike standard receive operations, a receive operation with the FI_PEEK flag set does not remain queued with the provider after the peek completes successfully.

I think the "successfully" bit here is wrong .. the peek operation just checks if the matching message has arrived or not, then it posts a completion/error event. The peek operation does not remained queued. This makes more sense if you read on...

The peek operation operates asynchronously, and the results of the peek operation are available in the completion queue associated with the endpoint. If no message is found matching the tags specified in the peek request, then a _completion queue error entry with err field set to FIENOMSG will be available.

This is what is happening, right? As @pkcoff said earlier, "... process_mfifo_context just posts an error completion event when it can’t match ..."

If a peek request locates a matching message, the operation will complete successfully. The returned completion data will indicate the meta-data associated with the message, such as the message length, completion flags, available CQ data, tag, and source address. The data available is subject to the completion entry format (e.g. struct fi_cq_tagged_entry).

I think the real problem here is that the ch4 ofi netmod is not checking for the error condition - it is just spinning in progress until the peek is successful. But the peek operation is not supposed to remain in the queue. The peek is a "one shot asynchronous" kind of operation.

Because MPI_Probe behaves like MPI_Iprobe except that it is a blocking call that returns only after a matching message has been found, the implementation should loop and post another peek operation when the request util_id is not MPIDI_OFI_PEEK_FOUND.

I think the ch4 ofi netmod code should be changed to something like this:

peek_flags |= FI_PEEK;
peek_flags |= FI_COMPLETION;
if (MPIDI_OFI_ENABLE_DATA) peek_flags |= FI_REMOTE_CQ_DATA;

do {
    MPIDI_OFI_REQUEST(rreq, util_id) = MPIDI_OFI_PEEK_START;
    MPIDI_OFI_CALL(fi_trecvmsg(MPIDI_OFI_EP_RX_TAG(0), &msg, peek_flags), trecvmsg);

    MPIDI_OFI_PROGRESS_WHILE(MPIDI_OFI_REQUEST(rreq, util_id) == MPIDI_OFI_PEEK_START);

while (MPIDI_OFI_REQUEST(rreq, util_id) != MPIDI_OFI_PEEK_FOUND);

raffenet commented 7 years ago

All netmod level probes are nonblocking. The loop for blocking probe happens at the CH4 level (https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/src/ch4_probe.h#L34), which should do as you describe - post another trecvmsg each time it calls down into MPIDI_NM_mpi_iprobe.

pkcoff commented 7 years ago

Per the discussion Mike and I and Ken had there was a discrepancy on the ofi side for auto vs manual mode in handling the error queue, in the manual mode way we were handling it the auto mode way, should be using the fi_bgq_cq_enqueue_err function which will properly handle for both modes, had a bug in there with locking i fixed and then it now works for my mpid_probe testcase, i put trace in ch4/ofi layer and that is functioning as we discussed with the MPIDI_OFI_peek_empty_event getting called and MPIDI_OFI_PEEK_NOT_FOUND set in the rreq. need to more generally apply the fi_bgq_cq_enqueue_err to the rest of the code, do some more probe testing then do full regression tests will update tomorrow

raffenet commented 7 years ago

Cool, sounds like you've got a handle on it. :+1:

pkcoff commented 7 years ago

Fixed in ofi, regression tests passed, waiting on user confirmation before I close this issue: https://github.com/ofiwg/libfabric/pull/2946

pkcoff commented 7 years ago

Finally got confirmation from the user that the iprobe is fixed, closing.

pmodels / mpich

BGQ: CH4 OFI MPI_Iprobe does not succeed when probing self #2439