Closed huttered40 closed 5 years ago
After further investigation, this big is more complex than I originally thought.
Even after saving the communicator and src/dest process information for each request, we get a hang. This must be because the ordering in which the MPI_Waitany is called is not known by each process. For example, there could be a situation in which process 1 is waiting on process 2, process 2 is waiting on process 3, and process 3 is waiting on process 1.
Its deadlocking.
The only idea I can think of actually fixes #25, but I'm not sure if its exactly correct: we perform all the PMPI_Waitany
calls, and then iterate over istop2
in the order of smallest request ID (since std::map
sorts by key by default since its implemented as a binary tree, iterating over the keys would achieve this.
Re-opening because on Stampede2 on 8 nodes with 64 ppn, we get another hang with ctf/examples/matmul
. Note that there is no hang for the exact same problem size on 1 node and 64 nodes with 64 ppn.
Re-opening because on Stampede2 on 8 nodes with 64 ppn, we get another hang with
ctf/examples/matmul
. Note that there is no hang for the exact same problem size on 1 node and 64 nodes with 64 ppn.
Confirmed there is no hang if I comment out the compute_all_crit
call from within istop2
method.
This immediately tells me that the idea I thought of and implemented above is not correct. I cannot simply iterate over the map's request keys and prevent hangs.
Next idea: sort the map separately by the destination rank (4th member in the tuple I think).
Next idea: sort the map separately by the destination rank (4th member in the tuple I think).
This idea seems to have worked. Closing tentatively.
After applying a quick fix to #21, I get a hang, and I think this must be because of the way the old critter code tracks the most-recent communicators (last_cm) and processes from sends/recvs (last_nbr_pe).
Basically, when >1 nonblocking request is active, critter only tracks the most recent and forgets about the past ones. This further supports the approach described in #20 in which the request map uses keys to point not just to the critter routine, but to these "last" process variables and a few other information.
This will then also require changes to how critter::stop reference these variables, but in #20, we detail istart1, istart2, istop1 and istop2 methods.