Closed huttered40 closed 4 years ago
Note that this goes along with the understanding that blocking communication, as in MPI_Send
, does not guarantee that the message was received. It only guarantees that the message can now be modified without corrupting the message in flight to the receiver.
MPI_Recv
's communication protocol is implicitly stricter than MPI_Send's blocking protocol, as it returns only when the message is received, yet is still under the umbrella of blocking. This is obviously must stricter than MPI_Irecv
, which doesn't require the message to be received at all before it returns, and on the same level as MPI_Wait
.
Possible scenarios (not all, but the relevant ones):
MPI_Send
+ MPI_Recv
(MPI_Isend
+MPI_Wait
) + (MPI_Irecv
+MPI_Wait
)
MPI_Send
+ (MPI_Irecv
+MPI_Wait
)
and 2. are handled correctly. 3. is not.
Check what the problem is with having MPI_Send
wait in critter::stop
. It hands there, but why is that bad? How does critter::istop2
via interception of MPI_Wait
differ?
Update: none of the above scenarios are handled correctly, as critter currently uses a more limiting communication protocol with which to propagate critical path data.
Here is what I posted on Mattermost that further summarizes the problem:
Regarding the new bug in critter I found by using CANDMC, the issue is when an MPI_Send is posted. Critter tries to intercept it and does a handshake with the receiver that posted MPI_Recv (via MPI_Sendrecv) to propagate the critical path information. Now this hasn't been an issue before because the p2p communication protocol has either been 1) MPI_Send+MPI_Recv or 2) (MPI_Isend+MPI_Wait)+(MPI_Irecv+MPI_Wait) where the critical path propagation occurs in MPI_Wait for the latter. However, CANDMC's bit-tree tsqr trailing matrix update routine uses (MPI_Send)+(MPI_Irecv+MPI_Wait), and thus it deadlocks when MPI_Send is intercepted as its waiting for the corresponding MPI_Recv to be posted. So the issue is isolated to essentially anytime a MPI_Send is posted, because now I think that even protocol 1) above is wrong semantically: critter shouldn't force the MPI_Send to wait for a handshake, as it uses a more limiting communication protocol than the user's communication routine.
The goal I had a few weeks back when debugging and fixing critter's nonblocking communication support was to only propagate the critical path when the communication protocol required an explicit handshake. MPI_Send is one outlier that I didn't think of that is blocking, but doesn't require a handshake, and thus ideally I wouldn't like to propagate the critical path to the receiver here. I need to think of a way to still propagate that data to the receiver though, because I don't want critter to give incorrect critical path data.
Edgar gave a potential fix: to have the sender post another send via PMPI_Send that the receiver will somehow expect, that will contain the sender's critical path information.
Mull on this.
Update: not strictly a PMPI_Send, but a send variant that maintains the same communication protocol as the routine its intercepting.
To continue expounding upon Edgar's idea above, I think it makes more sense to post a MPI_Irecv immediately in MPI_Send interception (without posting a MPI_Sendrecv), because then we can save the corresponding MPI_Request into a map and wait on it later (via MPI_Wait) either when we next get into a handshake communication protocol or the end of the program.
This begs a two new question: 1) if a process hits a communication routine requiring handshake protocol and has k unprocessed requests, how can we faithfully check against our critical paths, when the receiver's critical path information has grown stale? We'd almost have to save our process's critical path information in separate buckets according to before a request took place and then compare in sequence? Or would that break the protocol of not waiting on a handshake?
2) Does this interact, if at all, with critter's support for nonblocking communication?
Update: none of this is relevant anymore. The sender in a MPI_Send interception will never update its critical path based on that of the receiver, because the sender in blocking protocol is technically not dependent on the receiver in this parallel schedule. In a synchronous send, it will call a PMPI_Sendrecv as a form of barrier and will then propagate critical paths with its receiver before the user communication actually takes place. As even with synchronous p2p communication the two processes can leave the call before the other, we need not synchronize after the call. In a MPI_ISend interception, the sender will call another MPI_Isend with its critical path info (after the call returns so it includes its communication time for the Isend) and add the request to internal_requests
, to be checked on in the corresponding MPI_Wait. The _critter class can take a vector of pointers to hold outstanding messages holding the process's critical path info for nonblocking sends, thus allowing the internal_requests
global variable to only save the corresponding index and the pointer to the class itself.
Newsflash to my idiot self: The sender does not need critical path data from the receiver. Logically, the sender just sends and continues on with her work. The receiver is the only party that must get the critical path data from the receiver, because its clearly dependent on the receiver and its a dependency. Once a sender finishes sending (whatever protocol is used) he wipes his hands of any responsibility and continues on.
This does then mean that the sender will need to post a MPI_Send (not an MPI_Isend or it will need to check back up on it probably via an unnecessary MPI_Wait?) to the receiver again with his critical path info. How then will the MPI_Recv be handled? Or the MPI_Irecv+MPI_Wait?
Update: the first paragraph is all wrong. Critter is concerned with the parallel schedule, which can only be more constrained for parallelism than the underlying data dependencies of the algorithm. Therefore, the sender is not just a node in a DAG, but as part of the user's implemented parallel schedule that they want to investigate via critter, it can be dependent on the receiver depending on the communication protocol being used.
The second paragraph still poses an interesting question: in the situation where the sender sends its critical path info via MPI_Send which matches the user's MPI_Send that we intercepted, and where the receiver uses an MPI_Irecv+MPI_Wait, how will the receiver pick up the message from the sender? Well it can't occur in MPI_Irecv, as that would increase the limiting protocol from nonblocking to blocking. The receiver would have to be smart enough to realize that although it doesn't know whether the sender sent his critical path info via MPI_Send or MPI_Isend, it will pick it up in MPI_Wait, but still post an internal MPI_Irecv for this message. The receiver will wait on this after waiting on the user message, and will update its critical path after adding its local costs for this nonblocking routine. I think it makes more sense to do this after rather than before like with synchronous communication.
Remember: the keyword in critical path is dependent. Blocking communication does not force dependence for the sender on the receiver.
A few comments come to mind:
start_synch(..)
, start_block
, start_nonblock
, and start_asynch
. Any more?start_nonblock
and stop_nonblock
, and its a use case for not all communication using those two functions being p2p. So, as I keep saying over and over, critter must not increase the communication protocol requirements as part of its duties when intercepting calls. Therefore, the only routine that would work here to propagate the critical paths would be PMPI_Iallreduce
. Again, as with what is described above, each process involved would save the corresponding request in a map with the key being the MPI_Request ID for the actual user . As all nonblocking communication must get completed by either a variant of MPI_Wait, these variants should all check whether there exists an entry inside critter's internal_req_map (and if not, assert because this should never happen) and before dealing with critical path stuff, PMPI_Wait on that message. The idea of the data being stale, as described in 3) above, is still a concern here I think.Update: I just want to address 3. above. It will never be the case that a receiver is waiting on a critical path internal message that hasn't been posted yet, as its posted essentially at the same time as the user routine is posted. Also, for communication requiring handshake (i.e. synchronous), the critical path propogation takes place before the user routine communication, which gives the opportunity for processes to leave the user communication routine early and not have to wait both before and after. This works only because each process doesn't need to know which process in the communicator took the longest to finish, since its not their concern. They are clearly not dependent on it if they are allowed to leave early.
As to the concern about .. being stale, its important to realize that the data received by the receiving process is not truly a dependency until the blocking MPI_Wait call is called and the receiver actually needs it. This supports the view that the critical path data does not get stale, and we act as if it were a p2p requiring synchronization, because its not a dependency until its needed. I think this will also give a much more accurate view of the critical paths.
Fixed.
After extensive debugging, I think that the reason why
candmc::caqr_tree
is hanging inapply_QT.cxx
is becausecritter
does not know how to handle MPI_Send+MPI_Irecv+MPI_Wait logic. I modified theMPI
toPMPI
in that file, and it ran with no hangs.When
MPI_Send
is intercepted, it expectsMPI_Recv
to accompany it. This is not possible in the scenario above, and it just waits there, as theMPI_Irecv
interception does not go intostop()
method likeMPI_Send
does.How can we fix this?