open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

"mpirun --leave-session-attached" hangs in Cisco MTT runs under SLURM #3726

Open jsquyres opened 7 years ago

jsquyres commented 7 years ago

Over the past week or so, Cisco's MTT runs had 100% hangs (i.e., timeouts) across master, v2.0.x, v2.1.x, and v3.0.x. Jeff+Ralph narrowed the problem down to the use of --leave-session-attached in Cisco's MTT setup (which had been a recent addition, in an attempt to track down a different problem). Removing --leave-session-attached fixed the timeouts.

We did a bunch of investigation to try to figure out why --leave-session-attached was hanging. Here's some random notes (in no particular order) of what we found:

slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_p_slurmd_launch_request: 1634454.10 0
slurmd: launch task 1634454.10 request from 182726.25@10.3.0.254 (port 33517)
slurmd: debug3: state for jobid 1546119: ctime:1490332582 revoked:0 expires:0
slurmd: debug3: state for jobid 1549242: ctime:1490643927 revoked:0 expires:0
slurmd: debug3: state for jobid 1552625: ctime:1490923698 revoked:0 expires:0
slurmd: debug3: state for jobid 1622251: ctime:1496935165 revoked:0 expires:0
slurmd: debug3: state for jobid 1634452: ctime:1497786308 revoked:0 expires:0
slurmd: debug3: state for jobid 1634453: ctime:1497889737 revoked:1497889860 expires:1497889860
slurmd: debug3: state for jobid 1634454: ctime:1497889905 revoked:0 expires:0
slurmd: debug:  Checking credential with 276 bytes of sig data
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (mpi001), parent rank -1 (NONE), children 1, depth 0, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 1634454 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 9 to step 1634454.10
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 9 to step 1634454.10
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5016
slurmd: debug3: Entering _rpc_step_complete
slurmd: debug:  Entering stepd_completion, range_first = 1, range_last = 1

What's this RPC request to signal the tasks? We can see that it's sending signal 9 -- but who did that? And why? And then why did srun just hang?

This is probably a good place to start with when resuming the investigation.

rhc54 commented 7 years ago

One possibility - we do have the --kill-on-bad-exit option set on the srun cmd line. It could be that somehow that is getting invoked, which would explain where the signal 9 is coming from. We could remove that option in a test branch and see if it makes a difference, if you want to try?