Open sjpb opened 4 years ago
Some facts:
More non-causes:
all slurm troubleshooting suggestions completed without resolving issue (in particular note ping comment above)
treewidth: controls how messages fan-out through the cluster - slurm docs state that for autoscaling:
configure TreeWidth to a number at least as large as the maximum node count
In the stackhpc-openhpc role master branch this is left unset in slurm.conf and therefore should default to 50 (as per slurm docs) which would meet the requirement for the size of clusters tested. However a test was run with this explicitly set to 50, and this showed no difference in behaviour.
epilog: the stackhpc-openhpc role master branch creates an epilog which kills any user processes after the user's job has finished. The timing of the lost comms suggested this wasn't likely to be the issue but a test was run with this disabled, with no difference.
How to reproduce:
sbatch -n 4 -n runhello
. This will work fine./etc/log/slurmctld.log
or/etc/var/slurmpwr.log
sbatch -n 4 -n runhello
. This will work fine in that output will appear.sinfo
shows nodes as completing for ages:slurmctld.log
for this 2nd job withslurmpwr.log
interspersed in correct places:--- slurmpwr.log --- 2020-01-22T15:48:22.775829: scale.py running in resume mode to change ['ohpc-compute-2', 'ohpc-compute-3'] existing compute: ['ohpc-compute-0', 'ohpc-compute-1'] target compute: ['ohpc-compute-0', 'ohpc-compute-1', 'ohpc-compute-2', 'ohpc-compute-3'] calling: terraform apply -auto-approve -var nodenames="ohpc-compute-0 ohpc-compute-1 ohpc-compute-2 ohpc-compute-3" -target='openstack_compute_instance_v2.compute["ohpc-compute-2"]' -target='openstack_compute_instance_v2.compute["ohpc-compute-3"]' -target='local_file.hosts' in /home/centos/eiffel-ohpc/terraform_ohpc calling: ansible-playbook main.yml -i terraform_ohpc/ohpc_hosts in /home/centos/eiffel-ohpc scale.py finished
--- slurmctld.log -- [2020-01-22T15:50:51.032] Power save mode: 6 nodes [2020-01-22T15:53:29.994] Node ohpc-compute-3 now responding [2020-01-22T15:53:30.045] Node ohpc-compute-2 now responding [2020-01-22T15:53:40.641] job_time_limit: Configuration for JobId=6 complete [2020-01-22T15:53:40.641] Resetting JobId=6 start time for node power up [2020-01-22T15:53:43.527] _job_complete: JobId=6 WEXITSTATUS 0 [2020-01-22T15:53:43.546] _job_complete: JobId=6 done [2020-01-22T15:54:40.828] Resending TERMINATE_JOB request JobId=6 Nodelist=ohpc-compute-[2-3] [2020-01-22T15:55:40.025] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:00:40.051] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:00:54.240] Power save mode: 6 nodes [2020-01-22T16:05:40.206] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:10:40.256] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:11:04.422] Power save mode: 6 nodes [2020-01-22T16:15:11.250] cleanup_completing: JobId=6 completion process took 1288 seconds [2020-01-22T16:15:11.250] error: Nodes ohpc-compute-[2-3] not responding, setting DOWN
--- slurmpwr.log --- 2020-01-22T16:15:12.562373: scale.py running in suspend mode to change ['ohpc-compute-2', 'ohpc-compute-3'] existing compute: ['ohpc-compute-0', 'ohpc-compute-1', 'ohpc-compute-2', 'ohpc-compute-3'] target compute: ['ohpc-compute-0', 'ohpc-compute-1'] calling: terraform destroy -auto-approve -var nodenames="ohpc-compute-0 ohpc-compute-1" -target='openstack_compute_instance_v2.compute["ohpc-compute-2"]' -target='openstack_compute_instance_v2.compute["ohpc-compute-3"]' -target='local_file.hosts' in /home/centos/eiffel-ohpc/terraform_ohpc calling: terraform apply -var nodenames="ohpc-compute-0 ohpc-compute-1" -auto-approve -target=local_file.hosts in /home/centos/eiffel-ohpc/terraform_ohpc scale.py finished