stackhpc / eiffel-ohpc

Example OpenHPC cluster
Apache License 2.0
0 stars 1 forks source link

cloud nodes fail to respond after 2nd cycle #4

Open sjpb opened 4 years ago

sjpb commented 4 years ago

How to reproduce:

--- slurmpwr.log --- 2020-01-22T15:48:22.775829: scale.py running in resume mode to change ['ohpc-compute-2', 'ohpc-compute-3'] existing compute: ['ohpc-compute-0', 'ohpc-compute-1'] target compute: ['ohpc-compute-0', 'ohpc-compute-1', 'ohpc-compute-2', 'ohpc-compute-3'] calling: terraform apply -auto-approve -var nodenames="ohpc-compute-0 ohpc-compute-1 ohpc-compute-2 ohpc-compute-3" -target='openstack_compute_instance_v2.compute["ohpc-compute-2"]' -target='openstack_compute_instance_v2.compute["ohpc-compute-3"]' -target='local_file.hosts' in /home/centos/eiffel-ohpc/terraform_ohpc calling: ansible-playbook main.yml -i terraform_ohpc/ohpc_hosts in /home/centos/eiffel-ohpc scale.py finished

--- slurmctld.log -- [2020-01-22T15:50:51.032] Power save mode: 6 nodes [2020-01-22T15:53:29.994] Node ohpc-compute-3 now responding [2020-01-22T15:53:30.045] Node ohpc-compute-2 now responding [2020-01-22T15:53:40.641] job_time_limit: Configuration for JobId=6 complete [2020-01-22T15:53:40.641] Resetting JobId=6 start time for node power up [2020-01-22T15:53:43.527] _job_complete: JobId=6 WEXITSTATUS 0 [2020-01-22T15:53:43.546] _job_complete: JobId=6 done [2020-01-22T15:54:40.828] Resending TERMINATE_JOB request JobId=6 Nodelist=ohpc-compute-[2-3] [2020-01-22T15:55:40.025] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:00:40.051] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:00:54.240] Power save mode: 6 nodes [2020-01-22T16:05:40.206] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:10:40.256] error: Nodes ohpc-compute-[2-3] not responding [2020-01-22T16:11:04.422] Power save mode: 6 nodes [2020-01-22T16:15:11.250] cleanup_completing: JobId=6 completion process took 1288 seconds [2020-01-22T16:15:11.250] error: Nodes ohpc-compute-[2-3] not responding, setting DOWN

--- slurmpwr.log --- 2020-01-22T16:15:12.562373: scale.py running in suspend mode to change ['ohpc-compute-2', 'ohpc-compute-3'] existing compute: ['ohpc-compute-0', 'ohpc-compute-1', 'ohpc-compute-2', 'ohpc-compute-3'] target compute: ['ohpc-compute-0', 'ohpc-compute-1'] calling: terraform destroy -auto-approve -var nodenames="ohpc-compute-0 ohpc-compute-1" -target='openstack_compute_instance_v2.compute["ohpc-compute-2"]' -target='openstack_compute_instance_v2.compute["ohpc-compute-3"]' -target='local_file.hosts' in /home/centos/eiffel-ohpc/terraform_ohpc calling: terraform apply -var nodenames="ohpc-compute-0 ohpc-compute-1" -auto-approve -target=local_file.hosts in /home/centos/eiffel-ohpc/terraform_ohpc scale.py finished

sjpb commented 4 years ago

Some facts:

sjpb commented 4 years ago

See https://bugs.schedmd.com/show_bug.cgi?id=8380

sjpb commented 4 years ago

More non-causes: