huttered40 / critter

Critical path analysis of MPI parallel programs
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Hang bug in MPI_Wait routines calling critter::stop() #22

Closed huttered40 closed 5 years ago

huttered40 commented 5 years ago

After applying a quick fix to #21, I get a hang, and I think this must be because of the way the old critter code tracks the most-recent communicators (last_cm) and processes from sends/recvs (last_nbr_pe).

Basically, when >1 nonblocking request is active, critter only tracks the most recent and forgets about the past ones. This further supports the approach described in #20 in which the request map uses keys to point not just to the critter routine, but to these "last" process variables and a few other information.

This will then also require changes to how critter::stop reference these variables, but in #20, we detail istart1, istart2, istop1 and istop2 methods.

huttered40 commented 5 years ago

After further investigation, this big is more complex than I originally thought.

Even after saving the communicator and src/dest process information for each request, we get a hang. This must be because the ordering in which the MPI_Waitany is called is not known by each process. For example, there could be a situation in which process 1 is waiting on process 2, process 2 is waiting on process 3, and process 3 is waiting on process 1.

Its deadlocking.

The only idea I can think of actually fixes #25, but I'm not sure if its exactly correct: we perform all the PMPI_Waitany calls, and then iterate over istop2 in the order of smallest request ID (since std::map sorts by key by default since its implemented as a binary tree, iterating over the keys would achieve this.

huttered40 commented 5 years ago

Re-opening because on Stampede2 on 8 nodes with 64 ppn, we get another hang with ctf/examples/matmul. Note that there is no hang for the exact same problem size on 1 node and 64 nodes with 64 ppn.

huttered40 commented 5 years ago

Re-opening because on Stampede2 on 8 nodes with 64 ppn, we get another hang with ctf/examples/matmul. Note that there is no hang for the exact same problem size on 1 node and 64 nodes with 64 ppn.

Confirmed there is no hang if I comment out the compute_all_crit call from within istop2 method.

This immediately tells me that the idea I thought of and implemented above is not correct. I cannot simply iterate over the map's request keys and prevent hangs.

huttered40 commented 5 years ago

Next idea: sort the map separately by the destination rank (4th member in the tuple I think).

huttered40 commented 5 years ago

Next idea: sort the map separately by the destination rank (4th member in the tuple I think).

This idea seems to have worked. Closing tentatively.