mpickpt / mana

MANA for MPI
35 stars 24 forks source link

Check if a probed msg has a matched pending Irecv #300

Closed xuyao0127 closed 1 year ago

xuyao0127 commented 1 year ago

This PR fixes the bit-for-bit bug. While draining p2p messages at checkpoint time, it's possible that MPI_Iprobe detects a message for a pending MPI_Irecv. As a result, the message will be received by an additional MPI_Irecv and put into an internal buffer, which will be used after restart/resume. In other words, MPI_Iprobe overtakes the message and changes the order of messages that should be drained. This behavior breaks the standard that "Nonblocking communication operations are ordered according to the execution order of the calls that initiate the communication. "

This bug can happen when the network is busy, or messages are too large so that the progress engine doesn't share meta data of messages with the receivers. Pending MPI_Irecv's are not aware of the incoming messages. MPICH's MPI_Iprobe implementation will kick the progress engine if there are no incoming messages and rechecks the message queue. The change of progress and the recheck makes the inserted MPI_Iprobe by MANA can detect metadata before MPI_Irecv.

In this PR, messages detected by MPI_Iprobe will be compared with pending MPI_Irecvs' envelopes. If there's a matching MPI_Irecv, MANA will call MPI_Wait on the corresponding request to force the pending MPI_Irecv to claim the message and complete the communication. This change enforces the order of messages defined in the MPI standard.