lsst-uk / somerville-operations

User issue reporting and tracking for the Somerville Cloud
0 stars 0 forks source link

terraform deployment issue - qserv testing #176

Closed GregBlow closed 4 months ago

GregBlow commented 4 months ago

All attempts to deploy a 14 node qserv cluster fail with nodes failing to become ready e.g.:

│ Error: Error waiting for instance (7bc7a6b8-ae1a-49ab-bbc1-6d309e3e4323) to become ready: unexpected state 'ERROR', wanted target 'ACTIVE'. last error: %!s(<nil>)
│
│   with openstack_compute_instance_v2.worker[7],
│   on main.tf line 134, in resource "openstack_compute_instance_v2" "worker":
│  134: resource "openstack_compute_instance_v2" "worker" {
│
╵
╷
│ Error: Error waiting for instance (1be79feb-1a9a-4b47-baaf-05e762ddd697) to become ready: unexpected state 'ERROR', wanted target 'ACTIVE'. last error: %!s(<nil>)
│
│   with openstack_compute_instance_v2.worker[0],
│   on main.tf line 134, in resource "openstack_compute_instance_v2" "worker":
│  134: resource "openstack_compute_instance_v2" "worker" {
│

(on repeat attempts, different nodes are problematic. No one node always fails.)

GregBlow commented 4 months ago

always ~2/3 nodes fail

GregBlow commented 4 months ago

process appears to fail before ceph volumes are generated.

GregBlow commented 4 months ago

Apparent placement issue:

2024-07-05 15:55:11.631 18 WARNING nova.scheduler.utils [None req-f2f0b68b-bd3b-46f3-927a-7fe62b85cb6d ea36706f7f188e8ed8d1ee96d8b6c26027dc6e102b651dab57498532dde7d642 9168c636eaec419f807c46f1454e87a9 - - default default] Failed to compute_task_build_instances: No valid host was found.
Traceback (most recent call last):

  File "/var/lib/kolla/venv/lib64/python3.9/site-packages/oslo_messaging/rpc/server.py", line 244, in inner
    return func(*args, **kwargs)

  File "/var/lib/kolla/venv/lib64/python3.9/site-packages/nova/scheduler/manager.py", line 210, in select_destinations
    raise exception.NoValidHost(reason="")

nova.exception.NoValidHost: No valid host was found.
: nova.exception_Remote.NoValidHost_Remote: No valid host was found.
2024-07-05 15:55:11.633 18 WARNING nova.scheduler.utils [None req-f2f0b68b-bd3b-46f3-927a-7fe62b85cb6d ea36706f7f188e8ed8d1ee96d8b6c26027dc6e102b651dab57498532dde7d642 9168c636eaec419f807c46f1454e87a9 - - default default] [instance: 8a588032-8cb0-4a8a-b3c6-a54013e13931] Setting instance to ERROR state.: nova.exception_Remote.NoValidHost_Remote: No valid host was found.
2024-07-05 15:55:11.816 19 WARNING nova.scheduler.utils [None req-0f801e14-1074-4934-8de5-94dc59390dc3 ea36706f7f188e8ed8d1ee96d8b6c26027dc6e102b651dab57498532dde7d642 9168c636eaec419f807c46f1454e87a9 - - default default] Failed to compute_task_build_instances: No valid host was found.
Traceback (most recent call last):

  File "/var/lib/kolla/venv/lib64/python3.9/site-packages/oslo_messaging/rpc/server.py", line 244, in inner
    return func(*args, **kwargs)

  File "/var/lib/kolla/venv/lib64/python3.9/site-packages/nova/scheduler/manager.py", line 210, in select_destinations
    raise exception.NoValidHost(reason="")

nova.exception.NoValidHost: No valid host was found.
: nova.exception_Remote.NoValidHost_Remote: No valid host was found.
2024-07-05 15:55:11.817 19 WARNING nova.scheduler.utils [None req-0f801e14-1074-4934-8de5-94dc59390dc3 ea36706f7f188e8ed8d1ee96d8b6c26027dc6e102b651dab57498532dde7d642 9168c636eaec419f807c46f1454e87a9 - - default default] [instance: 5d9e4278-622c-49ed-9c92-593626703faf] Setting instance to ERROR state.: nova.exception_Remote.NoValidHost_Remote: No valid host was found.