cloud nodes fail to respond after 2nd cycle

sjpb commented 4 years ago

How to reproduce:

Create a min=2 max=4 node cluster
Run a 4x node job: sbatch -n 4 -n runhello. This will work fine.
Wait for nodes to be powered down. No errors occur in /etc/log/slurmctld.log or /etc/var/slurmpwr.log
Run a 4x node job: sbatch -n 4 -n runhello. This will work fine in that output will appear.

sinfo shows nodes as completing for ages:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2  comp* ohpc-compute-[2-3]
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

slurm thinks they aren't responding and eventually sets them down - slurmctld.log for this 2nd job with slurmpwr.log interspersed in correct places:


--- slurmctld.log --
[2020-01-22T15:47:02.576] _slurm_rpc_submit_batch_job: JobId=6 InitPrio=1000 usec=1129
[2020-01-22T15:47:03.193] sched: Allocate JobId=6 NodeList=ohpc-compute-[0-3] #CPUs=4 Partition=compute

--- slurmpwr.log --- 2020-01-22T15:48:22.775829: scale.py running in resume mode to change ['ohpc-compute-2', 'ohpc-compute-3'] existing compute: ['ohpc-compute-0', 'ohpc-compute-1'] target compute: ['ohpc-compute-0', 'ohpc-compute-1', 'ohpc-compute-2', 'ohpc-compute-3'] calling: terraform apply -auto-approve -var nodenames="ohpc-compute-0 ohpc-compute-1 ohpc-compute-2 ohpc-compute-3" -target='openstack_compute_instance_v2.compute["ohpc-compute-2"]' -target='openstack_compute_instance_v2.compute["ohpc-compute-3"]' -target='local_file.hosts' in /home/centos/eiffel-ohpc/terraform_ohpc calling: ansible-playbook main.yml -i terraform_ohpc/ohpc_hosts in /home/centos/eiffel-ohpc scale.py finished

--- slurmctld.log -- [2020-01-22T15:50:51.032] Power save mode: 6 nodes [2020-01-22T15:53:29.994] Node ohpc-compute-3 now responding [2020-01-22T15:53:30.045] Node ohpc-compute-2 now responding [2020-01-22T15:53:40.641] job_time_limit: Configuration for JobId=6 complete [2020-01-22T15:53:40.641] Resetting JobId=6 start time for node power up [2020-01-22T15:53:43.527] _job_complete: JobId=6 WEXITSTATUS 0 [2020-01-22T15:53:43.546] _job_complete: JobId=6 done [2020-01-22T15:54:40.828] Resending TERMINATE_JOB request JobId=6 Nodelist=ohpc-compute-[2-3] [2020-01-22T15:55:40.025] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:00:40.051] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:00:54.240] Power save mode: 6 nodes [2020-01-22T16:05:40.206] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:10:40.256] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:11:04.422] Power save mode: 6 nodes [2020-01-22T16:15:11.250] cleanup_completing: JobId=6 completion process took 1288 seconds [2020-01-22T16:15:11.250] error: Nodes ohpc-compute-[2-3] not responding, setting DOWN

--- slurmpwr.log --- 2020-01-22T16:15:12.562373: scale.py running in suspend mode to change ['ohpc-compute-2', 'ohpc-compute-3'] existing compute: ['ohpc-compute-0', 'ohpc-compute-1', 'ohpc-compute-2', 'ohpc-compute-3'] target compute: ['ohpc-compute-0', 'ohpc-compute-1'] calling: terraform destroy -auto-approve -var nodenames="ohpc-compute-0 ohpc-compute-1" -target='openstack_compute_instance_v2.compute["ohpc-compute-2"]' -target='openstack_compute_instance_v2.compute["ohpc-compute-3"]' -target='local_file.hosts' in /home/centos/eiffel-ohpc/terraform_ohpc calling: terraform apply -var nodenames="ohpc-compute-0 ohpc-compute-1" -auto-approve -target=local_file.hosts in /home/centos/eiffel-ohpc/terraform_ohpc scale.py finished

sjpb commented 4 years ago

Some facts:

Can ping "failing" compute node from control/login during no-comms period. Both by hostname and ip. So not a network issue
It's not caused by a suspend taking down (a) node(s) before completion: can see the suspend happen well after failures.

sjpb commented 4 years ago

See https://bugs.schedmd.com/show_bug.cgi?id=8380

sjpb commented 4 years ago

More non-causes:

all slurm troubleshooting suggestions completed without resolving issue (in particular note ping comment above)
treewidth: controls how messages fan-out through the cluster - slurm docs state that for autoscaling:

configure TreeWidth to a number at least as large as the maximum node count

In the stackhpc-openhpc role master branch this is left unset in slurm.conf and therefore should default to 50 (as per slurm docs) which would meet the requirement for the size of clusters tested. However a test was run with this explicitly set to 50, and this showed no difference in behaviour.
epilog: the stackhpc-openhpc role master branch creates an epilog which kills any user processes after the user's job has finished. The timing of the lost comms suggested this wasn't likely to be the issue but a test was run with this disabled, with no difference.

stackhpc / eiffel-ohpc

cloud nodes fail to respond after 2nd cycle #4