Open jsquyres opened 7 years ago
One possibility - we do have the --kill-on-bad-exit option set on the srun cmd line. It could be that somehow that is getting invoked, which would explain where the signal 9 is coming from. We could remove that option in a test branch and see if it makes a difference, if you want to try?
Over the past week or so, Cisco's MTT runs had 100% hangs (i.e., timeouts) across master, v2.0.x, v2.1.x, and v3.0.x. Jeff+Ralph narrowed the problem down to the use of
--leave-session-attached
in Cisco's MTT setup (which had been a recent addition, in an attempt to track down a different problem). Removing--leave-session-attached
fixed the timeouts.We did a bunch of investigation to try to figure out why
--leave-session-attached
was hanging. Here's some random notes (in no particular order) of what we found:salloc
did reproduce the problem 100% of the time (but still: running the same/path/to/mpirun --leave-session-attached ...
commands manually inside that samesalloc
did not reproduce the problem. Maddening!).salloc
and manually invoking the MTT client to MPI get, MPI install, Test get, Test build, and then repeatedly invoking the MTT client to Test run (e.g., the trivial tests -- which are especially helpful because they have a short MTT timeout).salloc
orsbatch
(one key difference being the location ofmpirun
: on the head node, or on the first node of the allocation).mpirun
is still running, and it has forked ansrun
to launch the remoteorted
s. However, noorted
s are running on remote nodes. ...but thesrun
is still running. Totally weird.srun
itself somehow hung and never launched anything on the remote nodes.slurmd
in foreground, verbose mode. We did get a clue here, but don't yet know what to make of it. Here's the log from the foregroundslurmd
when an MTTmpirun
was invoked (this was the 10thmpirun
that had run in this particular MTT run):What's this RPC request to signal the tasks? We can see that it's sending signal 9 -- but who did that? And why? And then why did
srun
just hang?This is probably a good place to start with when resuming the investigation.