huttered40 / critter

Critical path analysis of MPI parallel programs
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Harmful communication patterns - nonblocking+blocking #47

Closed huttered40 closed 4 years ago

huttered40 commented 4 years ago

While debugging Slate+critter, I noticed a communication pattern that Critter cannot current handle:

1) MPI_Isend + MPI_Recv

huttered40 commented 4 years ago

If a process posts an MPI_Isend and another process posts the corresponding MPI_Recv, the latter is blocking and the former is nonblocking, and the later will post an internal MPI_Recv to receive the internal MPI_Isend from the sending process (that was initiated when the user's MPI_Isend was initially intercepted).

The one concern I have is the order in which the sending process has called its MPI_Wait variant.

I actually don't think there is anything wrong here.

huttered40 commented 4 years ago

I ran some tests with print statements, and the MPI_Recvs don't seem to be completing with the corresponding MPI_Isends.

huttered40 commented 4 years ago

I introduced a nasty bug that I just fixed regarding use of internal tag for actual intercepted user communication.

Note that if the communication pattern is (MPI_Isend + MPI_Wait) + MPI_Recv, then MPI_Recv might not want to post an internal call to get info on synchronization cost, because it would have to receive a nonblocking Isend for this, which over-complicates things and messes up the tracking.

Note that MPI_Recv will not return until the message has been received and copied into the user's buffer, so if we did add an internal MPI_Recv with corresponding internal MPI_Isend+MPI_Waits, it should technically work, but I don't think tracking synchronous time here makes sense.

Note that for synchronous communication, we track synchronization/latency cost, but for blocking or nonblocking communication, we do not track synchronization cost.

Therefore, because SLATE uses MPI_Isend+MPI_Recv, we must make sure MPI_Recv uses blocking protocol, and not synchronous protocol, which can be set before build.