metal3-io / baremetal-operator

Bare metal host provisioning integration for Kubernetes
Apache License 2.0
591 stars 254 forks source link

registration failing with host busy #930

Closed maelk closed 3 years ago

maelk commented 3 years ago

In some rare cases, the registration of provisioned BMHs on Ironic fails with the following error message :

{"level":"info","ts":1625126381.8725624,"logger":"controllers.BareMetalHost","msg":"Retrying registration","baremetalhost":"<>","provisioningState":"provisioned"}
{"level":"info","ts":1625126381.8725815,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"<>","provisioningState":"provisioned","credentials":{"credentials":{"name":"<>","namespace":"<>"},"credentialsVersion":"109094"}}
{"level":"info","ts":1625126381.9132385,"logger":"provisioner.ironic","msg":"could not update node settings in ironic, busy","host":"<>"}
{"level":"info","ts":1625126381.9132583,"logger":"controllers.BareMetalHost","msg":"host not ready","baremetalhost":"<>","provisioningState":"provisioned","wait":10}
{"level":"info","ts":1625126381.913269,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"<>","provisioningState":"provisioned","requeue":true,"after":10}

The problem is difficult to recreate but we have hit it multiple times already, after ironic comes up while BMHs are already provisioned (like during a pivoting operation). When the issue happens, it happens for a lot of BMHs, but not all (38 out of 49 on the last time we hit the issue). The nodes are all in active state :

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: "2021-06-29T06:13:39Z"
  finalizers:
  - baremetalhost.metal3.io
  generation: 1
  labels:
    cluster.x-k8s.io/cluster-name: <>
  name: <>
  namespace: <>
  ownerReferences:
  - apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
    controller: true
    kind: Metal3Machine
    name: <>
    uid: 515c3398-bd49-43fa-a047-5ed37900e9a4
  resourceVersion: "6192099"
  selfLink: /apis/metal3.io/v1alpha1/namespaces/<>/baremetalhosts/<>
  uid: e4e08984-73da-4910-a966-1e20f745781f
spec:
  automatedCleaningMode: metadata
  bmc:
    address: redfish+https://<>
    credentialsName: <>
    disableCertificateVerification: true
  bootMACAddress: 78:ac:44:1b:20:20
  bootMode: UEFI
  consumerRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
    kind: Metal3Machine
    name: <>
    namespace: <>
  image:
    checksum: http://172.18.0.2:6181/<>.img.md5sum
    url: http://172.18.0.2:6181/<>.img
  metaData:
    name: <>-metadata
    namespace: <>
  online: true
  rootDeviceHints:
    deviceName: /dev/sda
  userData:
    name: <>
    namespace: <>
status:
  errorCount: 0
  errorMessage: ""
  goodCredentials:
    credentials:
      name: <>-secret
      namespace: <>
    credentialsVersion: "99257"
  hardware: <>
  hardwareProfile: unknown
  lastUpdated: "2021-06-29T06:53:02Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: "2021-06-29T04:02:05Z"
      start: "2021-06-29T03:51:04Z"
    provision:
      end: "2021-06-29T05:21:30Z"
      start: "2021-06-29T05:13:23Z"
    register:
      end: null
      start: "2021-06-29T06:15:52Z"
  operationalStatus: OK
  poweredOn: true
  provisioning:
    ID: 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5
    bootMode: UEFI
    image:
      checksum: http://172.18.0.2:6181/<>.img.md5sum
      url: http://172.18.0.2:6181/<>.img
    rootDeviceHints:
      deviceName: /dev/sda
    state: provisioned
  triedCredentials:
    credentials:
      name: <>-secret
      namespace: <>
    credentialsVersion: "107911"
{
  "uuid": "03c7da6d-5a23-454d-a03f-ca25fdbfa5c5",
  "created_at": "2021-06-29T06:11:59+00:00",
  "updated_at": "2021-07-01T09:31:12+00:00",
  "console_enabled": false,
  "driver": "redfish",
  "driver_info": {
    "deploy_kernel": "http://172.18.0.2:6180/images/ironic-python-agent.kernel",
    "deploy_ramdisk": "http://172.18.0.2:6180/images/ironic-python-agent.initramfs",
    "redfish_address": "https://<address>",
    "redfish_password": "******",
    "redfish_system_id": "<uri>",
    "redfish_username": "<username>",
    "redfish_verify_ca": false
  },
  "extra": {},
  "instance_info": {
    "capabilities": {},
    "image_source": "http://172.18.0.2:6181/<image-name>.img",
    "image_os_hash_algo": "md5",
    "image_os_hash_value": "http://172.18.0.2:6181/<image-name>.img.md5sum",
    "image_checksum": "http://172.18.0.2:6181/<image-name>.img.md5sum"
  },
  "instance_uuid": "deb4342f-5e18-43e9-b27c-24954da9d95e",
  "last_error": null,
  "maintenance": false,
  "maintenance_reason": null,
  "power_state": "power on",
  "properties": {
    "capabilities": "boot_mode:uefi",
    "vendor": "DELL"
  },
  "provision_state": "active",
  "provision_updated_at": "2021-06-29T06:12:30+00:00",
  "reservation": null,
  "target_power_state": null,
  "target_provision_state": null,
  "links": [
    {
      "href": "http://172.18.0.2:6385/v1/nodes/03c7da6d-5a23-454d-a03f-ca25fdbfa5c5",
      "rel": "self"
    },
    {
      "href": "http://172.18.0.2:6385/nodes/03c7da6d-5a23-454d-a03f-ca25fdbfa5c5",
      "rel": "bookmark"
    }
  ],
  "chassis_uuid": null,
  "ports": [
    {
      "href": "http://172.18.0.2:6385/v1/nodes/03c7da6d-5a23-454d-a03f-ca25fdbfa5c5/ports",
      "rel": "self"
    },
    {
      "href": "http://172.18.0.2:6385/nodes/03c7da6d-5a23-454d-a03f-ca25fdbfa5c5/ports",
      "rel": "bookmark"
    }
  ]
}

A node that does not have the problem in Ironic :

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: "2021-06-29T06:13:17Z"
  finalizers:
  - baremetalhost.metal3.io
  generation: 5
  labels:
    cluster.x-k8s.io/cluster-name: <>
  name: <>
  namespace: <>
  ownerReferences:
  - apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
    controller: true
    kind: Metal3Machine
    name: <>
    uid: cd1b203f-b807-4ae4-94ac-d38ce0b47cfe
  resourceVersion: "167781"
  selfLink: /apis/metal3.io/v1alpha1/namespaces/<>/baremetalhosts/<>
  uid: f755d123-c47e-4779-bc90-5d6144bb08a0
spec:
  automatedCleaningMode: metadata
  bmc:
    address: redfish+https://<>
    credentialsName: <>-secret
    disableCertificateVerification: true
  bootMACAddress: 90:38:09:46:a6:1c
  bootMode: UEFI
  consumerRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
    kind: Metal3Machine
    name: <>
    namespace: <>
  image:
    checksum: http://172.18.0.2:6181/<>.img.md5sum
    url: http://172.18.0.2:6181/<>.img
  metaData:
    name: <>-metadata
    namespace: <>
  online: true
  rootDeviceHints:
    deviceName: /dev/sda
  userData:
    name: <>
    namespace: <>
status:
  errorCount: 0
  errorMessage: ""
  goodCredentials:
    credentials:
      name: <>-secret
      namespace: <>
    credentialsVersion: "109534"
  hardware: <>
  hardwareProfile: unknown
  lastUpdated: "2021-06-29T06:53:41Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: "2021-06-29T04:20:48Z"
      start: "2021-06-29T04:11:21Z"
    provision:
      end: "2021-06-29T06:30:06Z"
      start: "2021-06-29T06:17:16Z"
    register:
      end: "2021-06-29T06:53:41Z"
      start: "2021-06-29T06:53:40Z"
  operationalStatus: OK
  poweredOn: true
  provisioning:
    ID: 8676e5ae-eda1-45be-a440-d9bc42d18d05
    bootMode: UEFI
    image:
      checksum: http://172.18.0.2:6181/<>.img.md5sum
      url: http://172.18.0.2:6181/<>.img
    rootDeviceHints:
      deviceName: /dev/sda
    state: provisioned
  triedCredentials:
    credentials:
      name: <>-secret
      namespace: <>
    credentialsVersion: "109534"

and the ironic node :

{
  "uuid": "8676e5ae-eda1-45be-a440-d9bc42d18d05",
  "created_at": "2021-06-29T06:12:15+00:00",
  "updated_at": "2021-06-29T07:21:06+00:00",
  "console_enabled": false,
  "driver": "redfish",
  "driver_info": {
    "deploy_kernel": "http://172.18.0.2:6180/images/ironic-python-agent.kernel",
    "deploy_ramdisk": "http://172.18.0.2:6180/images/ironic-python-agent.initramfs",
    "redfish_address": "https://<address>",
    "redfish_password": "******",
    "redfish_system_id": "<uri>",
    "redfish_username": "<username>",
    "redfish_verify_ca": false
  },
  "extra": {},
  "instance_info": {
    "capabilities": {},
    "image_source": "http://172.18.0.2:6181/<image-name>.img",
    "image_os_hash_algo": "md5",
    "image_os_hash_value": "http://172.18.0.2:6181/<image-name>.img.md5sum",
    "image_checksum": "http://172.18.0.2:6181/<image-name>.img.md5sum",
    "configdrive": "******",
    "image_url": "******",
    "image_type": "whole-disk-image"
  },
  "instance_uuid": "f755d123-c47e-4779-bc90-5d6144bb08a0",
  "last_error": null,
  "maintenance": false,
  "maintenance_reason": null,
  "power_state": "power on",
  "properties": {
    "capabilities": "boot_mode:uefi",
    "vendor": "Ericsson",
    "local_gb": 50,
    "root_device": {
      "name": "s== /dev/sda"
    },
    "cpu_arch": "x86_64"
  },
  "provision_state": "active",
  "provision_updated_at": "2021-06-29T06:30:00+00:00",
  "reservation": null,
  "target_power_state": null,
  "target_provision_state": null,
  "links": [
    {
      "href": "http://172.18.0.2:6385/v1/nodes/8676e5ae-eda1-45be-a440-d9bc42d18d05",
      "rel": "self"
    },
    {
      "href": "http://172.18.0.2:6385/nodes/8676e5ae-eda1-45be-a440-d9bc42d18d05",
      "rel": "bookmark"
    }
  ],
  "chassis_uuid": null,
  "ports": [
    {
      "href": "http://172.18.0.2:6385/v1/nodes/8676e5ae-eda1-45be-a440-d9bc42d18d05/ports",
      "rel": "self"
    },
    {
      "href": "http://172.18.0.2:6385/nodes/8676e5ae-eda1-45be-a440-d9bc42d18d05/ports",
      "rel": "bookmark"
    }
  ]
}

The main difference seems to be in the root device hints, applied properly or not.

maelk commented 3 years ago

Restarting Ironic with a DB wipe is enough to solve the issue.

maelk commented 3 years ago

On the ironic logs, we can see the following:

2021-07-01 10:13:15.363 1 DEBUG ironic.conductor.manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] RPC update_node called for node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5. update_node /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:187
2021-07-01 10:13:15.375 1 DEBUG ironic.conductor.task_manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] Attempting to get exclusive lock on node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 (for node update) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:233
2021-07-01 10:13:15.481 1 DEBUG ironic.conductor.task_manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] Node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 successfully reserved for node update (took 0.11 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:350
2021-07-01 10:13:15.490 1 DEBUG ironic.conductor.task_manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] Successfully released exclusive lock for node update on node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 (lock was held 0.01 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:447
2021-07-01 10:13:15.490 1 DEBUG ironic_lib.json_rpc.server [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] RPC error NodeAssociated: Node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 is associated with instance deb4342f-5e18-43e9-b27c-24954da9d95e. _handle_error /usr/lib/python3.6/site-packages/ironic_lib/json_rpc/server.py:179
2021-07-01 10:13:25.937 1 DEBUG ironic_lib.json_rpc.server [req-f1727bee-3f34-4d8d-af6c-cb915763dbb7 - - - - -] RPC update_node with {'node_obj': {'ironic_object.name': 'Node', 'ironic_object.namespace': 'ironic', 'ironic_object.version': '1.35', 'ironic_object.data': {'id': 11, 'uuid': '03c7da6d-5a23-454d-a03f-ca25fdbfa5c5', 'name': '<>', 'chassis_id': None, 'instance_uuid': 'e4e08984-73da-4910-a966-1e20f745781f', 'driver': 'redfish', 'driver_info': {'deploy_kernel': 'http://172.18.0.2:6180/images/ironic-python-agent.kernel', 'deploy_ramdisk': 'http://172.18.0.2:6180/images/ironic-python-agent.initramfs', 'redfish_address': 'https://<>', 'redfish_password': '***', 'redfish_system_id': '<>', 'redfish_username': '<>', 'redfish_verify_ca': False}, 'driver_internal_info': {'is_whole_disk_image': True}, 'clean_step': {}, 'deploy_step': {}, 'raid_config': {}, 'target_raid_config': {}, 'instance_info': {'capabilities': {}, 'image_source': 'http://172.18.0.2:6181/<>.img', 'image_os_hash_algo': 'md5', 'image_os_hash_value': 'http://172.18.0.2:6181/<>.img.md5sum', 'image_checksum': 'http://172.18.0.2:6181/<>.img.md5sum'}, 'properties': {'capabilities': 'boot_mode:uefi', 'vendor': 'DELL'}, 'reservation': None, 'conductor_affinity': 1, 'conductor_group': '', 'power_state': 'power on', 'target_power_state': None, 'provision_state': 'active', 'provision_updated_at': '2021-06-29T06:12:30Z', 'target_provision_state': None, 'maintenance': False, 'maintenance_reason': None, 'fault': None, 'console_enabled': False, 'last_error': None, 'resource_class': None, 'inspection_finished_at': None, 'inspection_started_at': None, 'extra': {}, 'automated_clean': True, 'protected': False, 'protected_reason': None, 'allocation_id': None, 'bios_interface': 'no-bios', 'boot_interface': 'ipxe', 'console_interface': 'no-console', 'deploy_interface': 'direct', 'inspect_interface': 'inspector', 'management_interface': 'redfish', 'network_interface': 'noop', 'power_interface': 'redfish', 'raid_interface': 'no-raid', 'rescue_interface': 'no-rescue', 'storage_interface': 'noop', 'vendor_interface': 'no-vendor', 'traits': {'ironic_object.name': 'TraitList', 'ironic_object.namespace': 'ironic', 'ironic_object.version': '1.0', 'ironic_object.data': {'objects': []}}, 'owner': None, 'lessee': None, 'description': None, 'retired': False, 'retired_reason': None, 'network_data': {}, 'created_at': '2021-06-29T06:11:59Z', 'updated_at': '2021-07-01T10:13:15Z'}, 'ironic_object.changes': ['instance_uuid']}, 'reset_interfaces': None, 'context': {'user': None, 'tenant': None, 'system_scope': None, 'project': None, 'domain': None, 'user_domain': None, 'project_domain': None, 'is_admin': False, 'read_only': False, 'show_deleted': False, 'auth_token': '***', 'request_id': 'req-f1727bee-3f34-4d8d-af6c-cb915763dbb7', 'global_request_id': None, 'resource_uuid': None, 'roles': [], 'user_identity': '- - - - -', 'is_admin_project': True}} _handle_requests /usr/lib/python3.6/site-packages/ironic_lib/json_rpc/server.py:279

The error seems to be RPC error NodeAssociated: Node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 is associated with instance deb4342f-5e18-43e9-b27c-24954da9d95e.

maelk commented 3 years ago

After a BMO restart, we can see one additional line int he BMO logs :

{"level":"info","ts":1625140464.6687908,"logger":"controllers.BareMetalHost","msg":"Retrying registration","baremetalhost":"<>","provisioningState":"provisioned"}
{"level":"info","ts":1625140464.6688247,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"<>","provisioningState":"provisioned","credentials":{"credentials":{"name":"<>","namespace":"<>"},"credentialsVersion":"107911"}}
{"level":"info","ts":1625140464.72969,"logger":"provisioner.ironic","msg":"updating option data","host":"<>","option":"instance_uuid","value":"e4e08984-73da-4910-a966-1e20f745781f","old_value":"deb4342f-5e18-43e9-b27c-24954da9d95e"}
{"level":"info","ts":1625140464.7298136,"logger":"provisioner.ironic","msg":"updating node settings in ironic","host":"<>"}
{"level":"info","ts":1625140465.0130422,"logger":"provisioner.ironic","msg":"could not update node settings in ironic, busy","host":"<>"}
{"level":"info","ts":1625140465.013073,"logger":"controllers.BareMetalHost","msg":"host not ready","baremetalhost":"<>","provisioningState":"provisioned","wait":10}
{"level":"info","ts":1625140465.0130858,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"<>","provisioningState":"provisioned","requeue":true,"after":10}

so it looks like it is trying to update the instance_uuid and this is not allowed

maelk commented 3 years ago

This seems to be an error coming from the pivoting flow. Closing this issue for now. I might open another one regarding the instance_uuid handling in case of mismatch /close

metal3-io-bot commented 3 years ago

@maelk: Closing this issue.

In response to [this](https://github.com/metal3-io/baremetal-operator/issues/930#issuecomment-872807176): >This seems to be an error coming from the pivoting flow. Closing this issue for now. I might open another one regarding the instance_uuid handling in case of mismatch >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.