Fix bug in caqr_tree - matching a MPI_Send with MPI_Irecv+MPI_Wait

After extensive debugging, I think that the reason why candmc::caqr_tree is hanging in apply_QT.cxx is because critter does not know how to handle MPI_Send+MPI_Irecv+MPI_Wait logic. I modified the MPI to PMPI in that file, and it ran with no hangs.

When MPI_Send is intercepted, it expects MPI_Recv to accompany it. This is not possible in the scenario above, and it just waits there, as the MPI_Irecv interception does not go into stop() method like MPI_Send does.

How can we fix this?

Note that this goes along with the understanding that blocking communication, as in MPI_Send, does not guarantee that the message was received. It only guarantees that the message can now be modified without corrupting the message in flight to the receiver.

MPI_Recv's communication protocol is implicitly stricter than MPI_Send's blocking protocol, as it returns only when the message is received, yet is still under the umbrella of blocking. This is obviously must stricter than MPI_Irecv, which doesn't require the message to be received at all before it returns, and on the same level as MPI_Wait.

Possible scenarios (not all, but the relevant ones):

MPI_Send + MPI_Recv
(MPI_Isend+MPI_Wait) + (MPI_Irecv+MPI_Wait)
MPI_Send + (MPI_Irecv+MPI_Wait)
and 2. are handled correctly. 3. is not.

Check what the problem is with having MPI_Send wait in critter::stop. It hands there, but why is that bad? How does critter::istop2 via interception of MPI_Wait differ?

Update: none of the above scenarios are handled correctly, as critter currently uses a more limiting communication protocol with which to propagate critical path data.

Here is what I posted on Mattermost that further summarizes the problem:

Regarding the new bug in critter I found by using CANDMC, the issue is when an MPI_Send is posted. Critter tries to intercept it and does a handshake with the receiver that posted MPI_Recv (via MPI_Sendrecv) to propagate the critical path information. Now this hasn't been an issue before because the p2p communication protocol has either been 1) MPI_Send+MPI_Recv or 2) (MPI_Isend+MPI_Wait)+(MPI_Irecv+MPI_Wait) where the critical path propagation occurs in MPI_Wait for the latter. However, CANDMC's bit-tree tsqr trailing matrix update routine uses (MPI_Send)+(MPI_Irecv+MPI_Wait), and thus it deadlocks when MPI_Send is intercepted as its waiting for the corresponding MPI_Recv to be posted. So the issue is isolated to essentially anytime a MPI_Send is posted, because now I think that even protocol 1) above is wrong semantically: critter shouldn't force the MPI_Send to wait for a handshake, as it uses a more limiting communication protocol than the user's communication routine.

The goal I had a few weeks back when debugging and fixing critter's nonblocking communication support was to only propagate the critical path when the communication protocol required an explicit handshake. MPI_Send is one outlier that I didn't think of that is blocking, but doesn't require a handshake, and thus ideally I wouldn't like to propagate the critical path to the receiver here. I need to think of a way to still propagate that data to the receiver though, because I don't want critter to give incorrect critical path data.

Edgar gave a potential fix: to have the sender post another send via PMPI_Send that the receiver will somehow expect, that will contain the sender's critical path information.

Mull on this.

Update: not strictly a PMPI_Send, but a send variant that maintains the same communication protocol as the routine its intercepting.

To continue expounding upon Edgar's idea above, I think it makes more sense to post a MPI_Irecv immediately in MPI_Send interception (without posting a MPI_Sendrecv), because then we can save the corresponding MPI_Request into a map and wait on it later (via MPI_Wait) either when we next get into a handshake communication protocol or the end of the program.

This begs a two new question: 1) if a process hits a communication routine requiring handshake protocol and has k unprocessed requests, how can we faithfully check against our critical paths, when the receiver's critical path information has grown stale? We'd almost have to save our process's critical path information in separate buckets according to before a request took place and then compare in sequence? Or would that break the protocol of not waiting on a handshake?

2) Does this interact, if at all, with critter's support for nonblocking communication?

Update: none of this is relevant anymore. The sender in a MPI_Send interception will never update its critical path based on that of the receiver, because the sender in blocking protocol is technically not dependent on the receiver in this parallel schedule. In a synchronous send, it will call a PMPI_Sendrecv as a form of barrier and will then propagate critical paths with its receiver before the user communication actually takes place. As even with synchronous p2p communication the two processes can leave the call before the other, we need not synchronize after the call. In a MPI_ISend interception, the sender will call another MPI_Isend with its critical path info (after the call returns so it includes its communication time for the Isend) and add the request to internal_requests, to be checked on in the corresponding MPI_Wait. The _critter class can take a vector of pointers to hold outstanding messages holding the process's critical path info for nonblocking sends, thus allowing the internal_requests global variable to only save the corresponding index and the pointer to the class itself.

Newsflash to my idiot self: The sender does not need critical path data from the receiver. Logically, the sender just sends and continues on with her work. The receiver is the only party that must get the critical path data from the receiver, because its clearly dependent on the receiver and its a dependency. Once a sender finishes sending (whatever protocol is used) he wipes his hands of any responsibility and continues on.

This does then mean that the sender will need to post a MPI_Send (not an MPI_Isend or it will need to check back up on it probably via an unnecessary MPI_Wait?) to the receiver again with his critical path info. How then will the MPI_Recv be handled? Or the MPI_Irecv+MPI_Wait?

Update: the first paragraph is all wrong. Critter is concerned with the parallel schedule, which can only be more constrained for parallelism than the underlying data dependencies of the algorithm. Therefore, the sender is not just a node in a DAG, but as part of the user's implemented parallel schedule that they want to investigate via critter, it can be dependent on the receiver depending on the communication protocol being used.

The second paragraph still poses an interesting question: in the situation where the sender sends its critical path info via MPI_Send which matches the user's MPI_Send that we intercepted, and where the receiver uses an MPI_Irecv+MPI_Wait, how will the receiver pick up the message from the sender? Well it can't occur in MPI_Irecv, as that would increase the limiting protocol from nonblocking to blocking. The receiver would have to be smart enough to realize that although it doesn't know whether the sender sent his critical path info via MPI_Send or MPI_Isend, it will pick it up in MPI_Wait, but still post an internal MPI_Irecv for this message. The receiver will wait on this after waiting on the user message, and will update its critical path after adding its local costs for this nonblocking routine. I think it makes more sense to do this after rather than before like with synchronous communication.

Remember: the keyword in critical path is dependent. Blocking communication does not force dependence for the sender on the receiver.

A few comments come to mind:

Critter will need to specify its own channel via a tag integer through which any extra data is sent. This is so that critter information in flight from sender to receiver does not get mixed up with the program's actual communication (i.e. if we send two MPI_Sends from the sender with the same tag (so through the same channel with the same communicator). Not sure what this value should be, but it should ideally be something the user would never use. Include an assert if user uses tag that matches that of Critter's internal tag.
The sender should not use MPI_Isend to communicate its critical path information to receiver, because then it would either need to wait on its completion via posting an MPI_Wait in the critter function (thus increasing the synchronization protocol requirements), or allocate new memory so that the buffer does not go out of scope (and thus corrupt message) when critter function that posted the MPI_Isend returns. If the MPI_Send implementation does not require a handshake, then this is just an extra message that will get buffered and return quickly since its a small message. If the MPI_Send implementation performs a handshake, then we are not losing much by forcing an extra handshake before program computation by that process resumes.
At the start of each handshake communication protocol (not just MPI_Send), do we now need to check if we have a message in flight carrying critical path info? How would it know if it did but the corresponding MPI_Irecv or MPI_Recv was not called yet? Well I guess its not a dependency until the sender needs it, at which point it is a dependency, so the answer is no. Therefore, if the receiver called some other MPI routine between when the sender sent its critical path data and when it posts a blocking receive, it shouldn't pose a problem. Now, what about stale critical path info? For example, what if sender sends, and receiver does more stuff with more communication before posting the blocking receive? Then when we compare our critical paths to that of the sender, is it a fair comparison, when the receiver is presumably farther along in its execution relative to the progress of the sender when the data was sent?
If the receiver posts a MPI_Irecv+MPI_Wait instead of a simple MPI_Recv, then critter will need to post a corresponding PMPI_Irecv for the critical path info from sender, save that request in a map with key of the actual receiver message request, and then when the corresponding MPI_Wait is posted, critter will know if it was waiting on any critical path data because it will use the request ID in a map to check if an entry exists.
I should also change how MPI_Isend is intercepted, and essentially remove any assumptions on which p2p communication is combined. Critter cannot increase the communication protocol requirements, so if a MPI_Isend is posted, critter cannot post an MPI_Send that propagates its critical path info. It must post an MPI_Isend instead, but then as mentioned above, must save that data so that when critter intercept routine returns, that message is not corrupted. It must also be completed, and this must be verified by saving that MPI_Request to an internal map also described above, that is checked when the corresponding MPI_Wait is called.
It looks like the istart1, istart2, istop1, and istop2 methods will need to be generalized further to support the ideas above. Now I think it makes sense to have separate start/stop functions for each communication protocol, i.e. start_synch(..), start_block, start_nonblock, and start_asynch. Any more?
Check how this interacts with #20.
So then what will critter's support for nonblocking collective communication look like? I assume it must make use of start_nonblock and stop_nonblock, and its a use case for not all communication using those two functions being p2p. So, as I keep saying over and over, critter must not increase the communication protocol requirements as part of its duties when intercepting calls. Therefore, the only routine that would work here to propagate the critical paths would be PMPI_Iallreduce. Again, as with what is described above, each process involved would save the corresponding request in a map with the key being the MPI_Request ID for the actual user . As all nonblocking communication must get completed by either a variant of MPI_Wait, these variants should all check whether there exists an entry inside critter's internal_req_map (and if not, assert because this should never happen) and before dealing with critical path stuff, PMPI_Wait on that message. The idea of the data being stale, as described in 3) above, is still a concern here I think.
Think about whether everything discussed above can simplify critter's handling of MPI_Waitall, MPI_Waitany, and MPI_Wait.

Update: I just want to address 3. above. It will never be the case that a receiver is waiting on a critical path internal message that hasn't been posted yet, as its posted essentially at the same time as the user routine is posted. Also, for communication requiring handshake (i.e. synchronous), the critical path propogation takes place before the user routine communication, which gives the opportunity for processes to leave the user communication routine early and not have to wait both before and after. This works only because each process doesn't need to know which process in the communicator took the longest to finish, since its not their concern. They are clearly not dependent on it if they are allowed to leave early.

As to the concern about .. being stale, its important to realize that the data received by the receiving process is not truly a dependency until the blocking MPI_Wait call is called and the receiver actually needs it. This supports the view that the critical path data does not get stale, and we act as if it were a p2p requiring synchronization, because its not a dependency until its needed. I think this will also give a much more accurate view of the critical paths.

huttered40 / critter

Fix bug in caqr_tree - matching a MPI_Send with MPI_Irecv+MPI_Wait #29