praiskup / resalloc

Allocator and manager for (expensive) resources
GNU General Public License v2.0
7 stars 7 forks source link

Resources staying in STARTING/DELETING state #134

Open praiskup opened 9 months ago

praiskup commented 9 months ago
268847 - aws_aarch64_spot_prod_00268847_20231218_005905 pool=aws_aarch64_spot_prod tags= status=STARTING releases=0 ticket=NULL
401511 - aws_aarch64_normalreserved_prod_00401511_20231228_232914 pool=aws_aarch64_normalreserved_prod tags= status=STARTING releases=0 ticket=NULL

These machines are STARTING for multiple days. The fact that the allocator failed should be detected.

praiskup commented 8 months ago

This may happen in two situations:

praiskup commented 7 months ago

A similar thing happens when deleting OpenStack instances, from time to time, after (not 100% this is triggering the problem)

Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/resalloc_openstack/helpers.py", line 74, in best_effort_delete
    self.delete()
  File "/usr/lib/python3.12/site-packages/resalloc_openstack/helpers.py", line 184, in delete
    self.nova_o.detach()
  File "/usr/lib/python3.12/site-packages/cinderclient/v3/volumes_base.py", line 69, in detach
    return self.manager.detach(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/v3/volumes_base.py", line 285, in detach
    return self._action('os-detach', volume,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/v3/volumes_base.py", line 257, in _action
    resp, body = self.api.client.post(url, body=body)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/client.py", line 223, in post
    return self._cs_request(url, 'POST', **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/client.py", line 211, in _cs_request
    return self.request(url, method, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/client.py", line 197, in request
    raise exceptions.from_response(resp, body)
cinderclient.exceptions.ClientException: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-1e419934-999b-4256-a07e-a6d5e369b9c5)
failed to delete in #1 attempt
praiskup commented 7 months ago

No, that would be different, I'm not sure what happened, starting of the instance in DELETING state failed:

+ ansible-playbook init.yml -i 10.0.150.201,
ERROR! the playbook: init.yml could not be found
running cleanup
cleaning 05_copr_vm_production_psi_os_00544952_20240229_172705_1
cleaning 10_server
deleting server 9e95963e-642b-41f3-b771-82411eba2386
Traceback (most recent call last):
  File "/usr/bin/resalloc-openstack-new", line 22, in <module>
    main() 
  File "/usr/lib/python3.12/site-packages/resalloc_openstack/new/main.py", line 131, in main
    check_call(args.command, env=env, shell=True, stdin=DEVNULL)
  File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -x ; ansible-playbook init.yml -i "$RESALLOC_OS_IP," >&2 </dev/null' returned non-zero exit status 1.

... probably stayed in STARTING becuase of this bug. Then I restarted resalloc, and it stayed in DELETING state after:

=== /var/log/resallocserver/hooks/544952_terminate ===
initializing <class 'resalloc_openstack.helpers.Server'>
vm copr_vm_production_psi_os_00544952_20240229_172705 not found
initializing <class 'resalloc_openstack.helpers.Server'>
vm copr_vm_production_psi_os_00544952_20240229_172705 not found
initializing <class 'resalloc_openstack.helpers.Server'>
vm copr_vm_production_psi_os_00544952_20240229_172705 not found