Closed maelk closed 3 years ago
Restarting Ironic with a DB wipe is enough to solve the issue.
On the ironic logs, we can see the following:
2021-07-01 10:13:15.363 1 DEBUG ironic.conductor.manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] RPC update_node called for node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5. update_node /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:187
2021-07-01 10:13:15.375 1 DEBUG ironic.conductor.task_manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] Attempting to get exclusive lock on node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 (for node update) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:233
2021-07-01 10:13:15.481 1 DEBUG ironic.conductor.task_manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] Node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 successfully reserved for node update (took 0.11 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:350
2021-07-01 10:13:15.490 1 DEBUG ironic.conductor.task_manager [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] Successfully released exclusive lock for node update on node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 (lock was held 0.01 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:447
2021-07-01 10:13:15.490 1 DEBUG ironic_lib.json_rpc.server [req-2ae1e8e9-b19c-4d20-bf5f-11b6228907fd - - - - -] RPC error NodeAssociated: Node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 is associated with instance deb4342f-5e18-43e9-b27c-24954da9d95e. _handle_error /usr/lib/python3.6/site-packages/ironic_lib/json_rpc/server.py:179
2021-07-01 10:13:25.937 1 DEBUG ironic_lib.json_rpc.server [req-f1727bee-3f34-4d8d-af6c-cb915763dbb7 - - - - -] RPC update_node with {'node_obj': {'ironic_object.name': 'Node', 'ironic_object.namespace': 'ironic', 'ironic_object.version': '1.35', 'ironic_object.data': {'id': 11, 'uuid': '03c7da6d-5a23-454d-a03f-ca25fdbfa5c5', 'name': '<>', 'chassis_id': None, 'instance_uuid': 'e4e08984-73da-4910-a966-1e20f745781f', 'driver': 'redfish', 'driver_info': {'deploy_kernel': 'http://172.18.0.2:6180/images/ironic-python-agent.kernel', 'deploy_ramdisk': 'http://172.18.0.2:6180/images/ironic-python-agent.initramfs', 'redfish_address': 'https://<>', 'redfish_password': '***', 'redfish_system_id': '<>', 'redfish_username': '<>', 'redfish_verify_ca': False}, 'driver_internal_info': {'is_whole_disk_image': True}, 'clean_step': {}, 'deploy_step': {}, 'raid_config': {}, 'target_raid_config': {}, 'instance_info': {'capabilities': {}, 'image_source': 'http://172.18.0.2:6181/<>.img', 'image_os_hash_algo': 'md5', 'image_os_hash_value': 'http://172.18.0.2:6181/<>.img.md5sum', 'image_checksum': 'http://172.18.0.2:6181/<>.img.md5sum'}, 'properties': {'capabilities': 'boot_mode:uefi', 'vendor': 'DELL'}, 'reservation': None, 'conductor_affinity': 1, 'conductor_group': '', 'power_state': 'power on', 'target_power_state': None, 'provision_state': 'active', 'provision_updated_at': '2021-06-29T06:12:30Z', 'target_provision_state': None, 'maintenance': False, 'maintenance_reason': None, 'fault': None, 'console_enabled': False, 'last_error': None, 'resource_class': None, 'inspection_finished_at': None, 'inspection_started_at': None, 'extra': {}, 'automated_clean': True, 'protected': False, 'protected_reason': None, 'allocation_id': None, 'bios_interface': 'no-bios', 'boot_interface': 'ipxe', 'console_interface': 'no-console', 'deploy_interface': 'direct', 'inspect_interface': 'inspector', 'management_interface': 'redfish', 'network_interface': 'noop', 'power_interface': 'redfish', 'raid_interface': 'no-raid', 'rescue_interface': 'no-rescue', 'storage_interface': 'noop', 'vendor_interface': 'no-vendor', 'traits': {'ironic_object.name': 'TraitList', 'ironic_object.namespace': 'ironic', 'ironic_object.version': '1.0', 'ironic_object.data': {'objects': []}}, 'owner': None, 'lessee': None, 'description': None, 'retired': False, 'retired_reason': None, 'network_data': {}, 'created_at': '2021-06-29T06:11:59Z', 'updated_at': '2021-07-01T10:13:15Z'}, 'ironic_object.changes': ['instance_uuid']}, 'reset_interfaces': None, 'context': {'user': None, 'tenant': None, 'system_scope': None, 'project': None, 'domain': None, 'user_domain': None, 'project_domain': None, 'is_admin': False, 'read_only': False, 'show_deleted': False, 'auth_token': '***', 'request_id': 'req-f1727bee-3f34-4d8d-af6c-cb915763dbb7', 'global_request_id': None, 'resource_uuid': None, 'roles': [], 'user_identity': '- - - - -', 'is_admin_project': True}} _handle_requests /usr/lib/python3.6/site-packages/ironic_lib/json_rpc/server.py:279
The error seems to be RPC error NodeAssociated: Node 03c7da6d-5a23-454d-a03f-ca25fdbfa5c5 is associated with instance deb4342f-5e18-43e9-b27c-24954da9d95e.
After a BMO restart, we can see one additional line int he BMO logs :
{"level":"info","ts":1625140464.6687908,"logger":"controllers.BareMetalHost","msg":"Retrying registration","baremetalhost":"<>","provisioningState":"provisioned"}
{"level":"info","ts":1625140464.6688247,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"<>","provisioningState":"provisioned","credentials":{"credentials":{"name":"<>","namespace":"<>"},"credentialsVersion":"107911"}}
{"level":"info","ts":1625140464.72969,"logger":"provisioner.ironic","msg":"updating option data","host":"<>","option":"instance_uuid","value":"e4e08984-73da-4910-a966-1e20f745781f","old_value":"deb4342f-5e18-43e9-b27c-24954da9d95e"}
{"level":"info","ts":1625140464.7298136,"logger":"provisioner.ironic","msg":"updating node settings in ironic","host":"<>"}
{"level":"info","ts":1625140465.0130422,"logger":"provisioner.ironic","msg":"could not update node settings in ironic, busy","host":"<>"}
{"level":"info","ts":1625140465.013073,"logger":"controllers.BareMetalHost","msg":"host not ready","baremetalhost":"<>","provisioningState":"provisioned","wait":10}
{"level":"info","ts":1625140465.0130858,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"<>","provisioningState":"provisioned","requeue":true,"after":10}
so it looks like it is trying to update the instance_uuid and this is not allowed
This seems to be an error coming from the pivoting flow. Closing this issue for now. I might open another one regarding the instance_uuid handling in case of mismatch /close
@maelk: Closing this issue.
In some rare cases, the registration of provisioned BMHs on Ironic fails with the following error message :
The problem is difficult to recreate but we have hit it multiple times already, after ironic comes up while BMHs are already provisioned (like during a pivoting operation). When the issue happens, it happens for a lot of BMHs, but not all (38 out of 49 on the last time we hit the issue). The nodes are all in active state :
A node that does not have the problem in Ironic :
and the ironic node :
The main difference seems to be in the root device hints, applied properly or not.