metal3-io / ironic-image

Container image to run OpenStack Ironic as part of Metal³
Apache License 2.0
56 stars 116 forks source link

Issue with provisioning ILO6(HPE) Server using Ironic v24.1.1 #528

Open kmg1905 opened 1 month ago

kmg1905 commented 1 month ago

I'm trying to provision ILO6(HPE) servers using ironic/BMO deployment but the servers seems to be stuck in inspecting phase. From BMH: BMH1 - Normal InspectionError 5m12s metal3-baremetal-controller timeout reached while inspecting the node. BMH2 - Failed to inspect hardware. Reason: unable to start inspection: Failed to set node power state to power on. From https://docs.openstack.org/ironic/latest/admin/drivers/ilo.html and looks like redfish driver is already present in the ironic-config.

From ironic.conf.j2

[DEFAULT]
auth_strategy = noauth
debug = true
default_deploy_interface = direct
default_inspect_interface = {% if env.USE_IRONIC_INSPECTOR == "true" %}inspector{% else %}agent{% endif %}
default_network_interface = noop
enabled_bios_interfaces = no-bios,redfish,idrac-redfish,irmc,ilo
enabled_boot_interfaces = ipxe,ilo-ipxe,pxe,ilo-pxe,fake,redfish-virtual-media,idrac-redfish-virtual-media,ilo-virtual-media
enabled_deploy_interfaces = direct,fake,ramdisk,custom-agent
# NOTE(dtantsur): when changing this, make sure to update the driver
# dependencies in Dockerfile.
enabled_hardware_types = ipmi,idrac,irmc,fake-hardware,redfish,manual-management,ilo,ilo5
enabled_inspect_interfaces = {% if env.USE_IRONIC_INSPECTOR == "true" %}inspector{% else %}agent{% endif %},irmc,fake,redfish,ilo
enabled_management_interfaces = ipmitool,irmc,fake,redfish,idrac-redfish,ilo,ilo5,noop
enabled_power_interfaces = ipmitool,irmc,fake,redfish,idrac-redfish,ilo
enabled_raid_interfaces = no-raid,irmc,agent,fake,redfish,idrac-redfish,ilo5
enabled_vendor_interfaces = no-vendor,ipmitool,idrac-redfish,redfish,ilo,fake
enabled_firmware_interfaces = no-firmware,fake,redfish

Note that I did not notice any issue using the same deployment in ILO5(HPE) Servers.

Did I miss anything?

metal3-io-bot commented 1 month ago

This issue is currently awaiting triage. If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance. The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
elfosardo commented 1 month ago

@kmg1905 thanks for reporting this, please be aware that ilo6 is not officially supported, and it's quite different from ilo5 both are redfish based, but while ilo5 requires a dedicated driver, ilo6 should work with the normal redfish driver, which makes me wonder if we're starting to see a divergence in standard redfish implementation as happened for ilo5 it would be interesting to get the BMO and ironic logs if possible to see what's happening, plus the BMH configuration used to deploy the node

Rozzii commented 1 month ago

/triage needs-information

kmg1905 commented 1 month ago

BMO logs

{"level":"info","ts":1720626563.805038,"logger":"provisioner.ironic","msg":"current provision state","host":"test-system~yz-ab-bm02","lastError":"","current":"inspect wait","target":"manageable"}
{"level":"info","ts":1720626563.805154,"logger":"controllers.BareMetalHost","msg":"using hardware profile","baremetalhost":{"name":"yz-ab-bm02","namespace":"test-system"},"provisioningState":"inspecting","profile":"unknown"}
{"level":"info","ts":1720626563.8053367,"logger":"controllers.BareMetalHost","msg":"verified access to the BMC","baremetalhost":{"name":"yz-ab-bm02","namespace":"test-system"},"provisioningState":"inspecting"}
{"level":"info","ts":1720626563.8053594,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":{"name":"yz-ab-bm02","namespace":"test-system"},"provisioningState":"inspecting"}
{"level":"info","ts":1720626563.8053648,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":{"name":"yz-ab-bm02","namespace":"test-system"},"provisioningState":"inspecting"}
{"level":"info","ts":1720626563.8053703,"logger":"provisioner.ironic","msg":"inspecting hardware","host":"test-system~yz-ab-bm02"}
{"level":"info","ts":1720626563.8118093,"logger":"provisioner.ironic","msg":"inspection in progress","host":"test-system~yz-ab-bm02"}
{"level":"info","ts":1720626563.8118536,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":{"name":"yz-ab-bm02","namespace":"test-system"},"provisioningState":"inspecting","requeue":true,"after":15}

The above logs are repeated continuously until inspection is timed out.

kmg1905 commented 1 month ago

BMH object

{
    "apiVersion": "metal3.io/v1alpha1",
    "kind": "BareMetalHost",
    "metadata": {
        "creationTimestamp": "2024-07-09T22:37:48Z",
        "finalizers": [
            "baremetalhost.metal3.io"
        ],
        "generation": 1,
        "name": "yz-ab-bm02",
        "namespace": "test-system",
        "ownerReferences": [
            {
                "apiVersion": "system.private.xxx.xxxx/v1alpha1",
                "blockOwnerDeletion": true,
                "controller": true,
                "kind": "Server",
                "name": "yz-ab-bm02",
                "uid": "88458b0f-8b8f-47f2-96fc-f5a4bc1b01b2"
            }
        ],
        "resourceVersion": "5499648",
        "uid": "481a8af3-455f-4c24-970a-9b032529d2a4"
    },
    "spec": {
        "automatedCleaningMode": "metadata",
        "bmc": {
            "address": "redfish://10.128.136.67/redfish/v1/Systems/1",
            "credentialsName": "bmc-credentials-yz-ab-bm02",
            "disableCertificateVerification": true
        },
        "bootMACAddress": "04:32:01:5f:c6:34",
        "bootMode": "UEFISecureBoot",
        "networkData": {
            "name": "yz-ab-bm02-networkdata",
            "namespace": "test-system"
        },
        "online": true,
        "userData": {
            "name": "yz-ab-bm02-userdata",
            "namespace": "test-system"
        }
    },
    "status": {
        "errorCount": 10,
        "errorMessage": "timeout reached while inspecting the node",
        "errorType": "inspection error",
        "goodCredentials": {
            "credentials": {
                "name": "bmc-credentials-yz-ab-bm02",
                "namespace": "test-system"
            },
            "credentialsVersion": "5241084"
        },
        "hardwareProfile": "unknown",
        "lastUpdated": "2024-07-10T12:50:29Z",
        "operationHistory": {
            "deprovision": {
                "end": null,
                "start": null
            },
            "inspect": {
                "end": null,
                "start": "2024-07-09T22:37:59Z"
            },
            "provision": {
                "end": null,
                "start": null
            },
            "register": {
                "end": "2024-07-09T22:37:59Z",
                "start": "2024-07-09T22:37:48Z"
            }
        },
        "operationalStatus": "error",
        "poweredOn": false,
        "provisioning": {
            "ID": "0791a34f-4b19-49a6-b500-b42f4800265d",
            "bootMode": "UEFISecureBoot",
            "image": {
                "url": ""
            },
            "rootDeviceHints": {
                "deviceName": "/dev/sda"
            },
            "state": "inspecting"
        },
        "triedCredentials": {
            "credentials": {
                "name": "bmc-credentials-yz-ab-bm02",
                "namespace": "test-system"
            },
            "credentialsVersion": "5241084"
        }
    }
}
kmg1905 commented 1 month ago

Ironic logs

2024-07-10 15:42:47.205 1 INFO ironic.conductor.utils [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Detected vendor HPE for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a
2024-07-10 15:42:47.944 1 INFO ironic.conductor.utils [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Updated boot_mode uefi, secure_boot Falsefor node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a
2024-07-10 15:42:47.944 1 DEBUG ironic.conductor.utils [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Getting BIOS info for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a node_cache_bios_settings /usr/lib/python3.9/site-packages/ironic/conductor/utils.py:1440
2024-07-10 15:42:48.534 1 DEBUG ironic.drivers.modules.redfish.bios [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Cache BIOS settings for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a cache_bios_settings /usr/lib/python3.9/site-packages/ironic/drivers/modules/redfish/bios.py:127
2024-07-10 15:42:48.563 1 DEBUG ironic.conductor.utils [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Getting Firmware Components for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a node_cache_firmware_components /usr/lib/python3.9/site-packages/ironic/conductor/utils.py:1829
2024-07-10 15:42:49.098 1 INFO ironic.conductor.task_manager [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a moved to provision state "manageable" from state "verifying"; target provision state is "None"
2024-07-10 15:42:49.099 1 INFO ironic.conductor.verify [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Successfully verified node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a
2024-07-10 15:42:49.101 1 DEBUG ironic.conductor.task_manager [None req-60c3b4b1-30a2-40d5-ad91-be7bc1e650d3 - - - - - -] Successfully released exclusive lock for caching boot_mode or secure_boot state on node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a (lock was held 2.49 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448
2024-07-10 15:42:56.646 1 INFO eventlet.wsgi.server [None req-3c8ca90a-745f-4aa4-990e-a198719304de - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3284 time: 0.0052984
2024-07-10 15:42:56.655 1 INFO eventlet.wsgi.server [None req-12fca5d8-45a2-41b6-997d-8e920feed68b - - - - - -] ::ffff:172.19.0.1 "GET /v1/ports?fields=node_uuid&node_uuid=33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 584 time: 0.0068839
2024-07-10 15:42:56.673 1 DEBUG ironic.conductor.manager [None req-8fcf3a77-72c6-42a2-8484-8f244d05eba3 - - - - - -] RPC update_node called for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a. update_node /usr/lib/python3.9/site-packages/ironic/conductor/manager.py:201
2024-07-10 15:42:56.676 1 DEBUG ironic.conductor.task_manager [None req-8fcf3a77-72c6-42a2-8484-8f244d05eba3 - - - - - -] Attempting to get exclusive lock on node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a (for node update) __init__ /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:235
2024-07-10 15:42:56.681 1 DEBUG ironic.conductor.task_manager [None req-8fcf3a77-72c6-42a2-8484-8f244d05eba3 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a successfully reserved for node update (took 0.00 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352
2024-07-10 15:42:56.689 1 DEBUG ironic.conductor.task_manager [None req-8fcf3a77-72c6-42a2-8484-8f244d05eba3 - - - - - -] Successfully released exclusive lock for node update on node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a (lock was held 0.01 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448
2024-07-10 15:42:56.690 1 INFO eventlet.wsgi.server [None req-8fcf3a77-72c6-42a2-8484-8f244d05eba3 - - - - - -] ::ffff:172.19.0.1 "PATCH /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3462 time: 0.0341187
2024-07-10 15:42:56.722 1 INFO eventlet.wsgi.server [None req-e5345ba0-f81f-4bd8-bcb0-74ea31889c0d - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0045698
2024-07-10 15:42:56.730 1 INFO eventlet.wsgi.server [None req-a824d246-9b2c-4cdf-86b0-a229249f6d0c - - - - - -] ::ffff:172.19.0.1 "GET /v1/ports?fields=node_uuid&node_uuid=33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 584 time: 0.0066807
2024-07-10 15:42:56.787 1 INFO eventlet.wsgi.server [None req-574ab850-9bf4-470b-bafa-c0b85972395c - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0375009
2024-07-10 15:42:56.874 1 INFO eventlet.wsgi.server [None req-d7383c0b-8fab-4cdb-9025-c05c26483402 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/bios?detail=true HTTP/1.1" status: 200  len: 163033 time: 0.0799267
2024-07-10 15:42:56.876 1 INFO eventlet.wsgi.server [None req-be04b874-22b6-49eb-8e0d-5ced5ea443b5 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0715086
2024-07-10 15:42:56.889 1 INFO eventlet.wsgi.server [None req-ed490a7b-1ce0-4ae6-abbd-90dc0036b7fb - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/firmware HTTP/1.1" status: 200  len: 719 time: 0.0119681
2024-07-10 15:42:56.918 1 INFO eventlet.wsgi.server [None req-c67da639-d41b-4f99-a9c9-e2d8cdaab40b - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0044971
2024-07-10 15:42:56.972 1 INFO eventlet.wsgi.server [None req-4a03d6a6-34c0-4919-a28f-5a757cd0abd1 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0536215
2024-07-10 15:42:56.986 1 INFO eventlet.wsgi.server [None req-6b1220df-90f2-4721-81a5-87a2f253c1c0 - - - - - -] ::ffff:172.19.0.1 "GET /v1/ports?fields=node_uuid&node_uuid=33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 584 time: 0.0136805
2024-07-10 15:42:56.999 1 INFO eventlet.wsgi.server [None req-171dbaee-960a-4c06-b4e6-ed660bfa4a3d - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/firmware HTTP/1.1" status: 200  len: 719 time: 0.0181811
2024-07-10 15:42:57.001 1 INFO eventlet.wsgi.server [None req-ea6000c3-bd97-4e9d-8088-b4ccfd917044 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0066490
2024-07-10 15:42:57.082 1 INFO eventlet.wsgi.server [None req-4a2f2658-9144-438c-b70f-a5401b1057f7 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/inventory HTTP/1.1" status: 404  len: 454 time: 0.0645373
2024-07-10 15:42:57.089 1 DEBUG ironic.conductor.manager [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] RPC inspect_hardware called for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a inspect_hardware /usr/lib/python3.9/site-packages/ironic/conductor/manager.py:3041
2024-07-10 15:42:57.098 1 DEBUG ironic.conductor.task_manager [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Attempting to get exclusive lock on node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a (for hardware inspection) __init__ /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:235
2024-07-10 15:42:57.100 1 INFO eventlet.wsgi.server [None req-9d0a9577-a04f-43ac-8357-ddf1f2234e87 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3450 time: 0.0094213
2024-07-10 15:42:57.118 1 INFO eventlet.wsgi.server [None req-10de2c6f-5a37-44ac-a795-7740a59501a0 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/firmware HTTP/1.1" status: 200  len: 719 time: 0.0144954
2024-07-10 15:42:57.175 1 DEBUG ironic.conductor.task_manager [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a successfully reserved for hardware inspection (took 0.08 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352
2024-07-10 15:42:57.187 1 INFO ironic.conductor.task_manager [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a moved to provision state "inspecting" from state "manageable"; target provision state is "manageable"
2024-07-10 15:42:57.188 1 INFO eventlet.wsgi.server [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] ::ffff:172.19.0.1 "PUT /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/states/provision HTTP/1.1" status: 202  len: 366 time: 0.1031435
<200 Srary>
2024-07-10 15:43:04.636 1 INFO ironic.drivers.modules.inspect_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Port already exists for MAC address 04:32:01:5f:c6:34 for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a
2024-07-10 15:43:04.641 1 INFO ironic.drivers.modules.inspect_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Port created for MAC address 88:e9:a4:59:48:3c for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a
2024-07-10 15:43:04.645 1 INFO ironic.drivers.modules.inspect_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Port created for MAC address 88:e9:a4:59:48:3d for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a
2024-07-10 15:43:04.648 1 DEBUG ironic.drivers.modules.inspect_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Will use the following BMC addresses for inspection lookup of node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a: {'10.128.136.67'} cache_lookup_addresses /usr/lib/python3.9/site-packages/ironic/drivers/modules/inspect_utils.py:448
2024-07-10 15:43:04.654 1 DEBUG ironic.drivers.modules.inspector.interface [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Starting inspection for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a using ironic-inspector, booting is managed by ironic inspect_hardware /usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector/interface.py:217
2024-07-10 15:43:04.892 1 INFO ironic.conductor.utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a current power state is 'power on', requested state is 'power off'.
2024-07-10 15:43:06.570 1 INFO eventlet.wsgi.server [None req-8a8182ec-d3e4-490c-87d9-26f05a4218e6 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3636 time: 0.0056274
2024-07-10 15:43:06.595 1 INFO eventlet.wsgi.server [None req-c46bf02b-5907-47de-8e82-177129c1c4b6 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/bios?detail=true HTTP/1.1" status: 200  len: 163033 time: 0.0230024
2024-07-10 15:43:07.233 1 INFO eventlet.wsgi.server [None req-44adb092-850b-41b9-86a7-c31209b61478 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3636 time: 0.0053651
2024-07-10 15:43:07.242 1 INFO eventlet.wsgi.server [None req-b1d7ce4c-7aa8-4454-a0d7-953849be3954 - - - - - -] ::ffff:172.19.0.1 "GET /v1/ports?fields=node_uuid&node_uuid=33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 1126 time: 0.0072458
2024-07-10 15:43:07.249 1 INFO eventlet.wsgi.server [None req-87229175-578d-4834-b1c9-e008fa657320 - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3636 time: 0.0046976
2024-07-10 15:43:07.332 1 INFO ironic.conductor.utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Successfully set node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a power state to power off by power off.
2024-07-10 15:43:07.334 1 DEBUG ironic.drivers.modules.boot_mode_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Deploy boot mode is uefi for 33054d8d-8ba6-41f5-96d9-3faebdb07b6a. get_boot_mode_for_deploy /usr/lib/python3.9/site-packages/ironic/drivers/modules/boot_mode_utils.py:250
2024-07-10 15:43:07.334 1 DEBUG ironic.drivers.modules.boot_mode_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Deploy boot mode is uefi for 33054d8d-8ba6-41f5-96d9-3faebdb07b6a. get_boot_mode_for_deploy /usr/lib/python3.9/site-packages/ironic/drivers/modules/boot_mode_utils.py:250
2024-07-10 15:43:07.334 1 WARNING ironic.common.pxe_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] IPv6 is enabled and the DHCP driver appears set to a plugin aside from "neutron". Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a may not receive proper DHCPv6 provided boot parameters.
2024-07-10 15:43:07.563 1 DEBUG ironic.drivers.modules.boot_mode_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Deploy boot mode is uefi for 33054d8d-8ba6-41f5-96d9-3faebdb07b6a. get_boot_mode_for_deploy /usr/lib/python3.9/site-packages/ironic/drivers/modules/boot_mode_utils.py:250
2024-07-10 15:43:07.563 1 DEBUG ironic.common.pxe_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Building PXE config for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a create_pxe_config /usr/lib/python3.9/site-packages/ironic/common/pxe_utils.py:328
2024-07-10 15:43:07.563 1 DEBUG ironic.drivers.modules.boot_mode_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Deploy boot mode is uefi for 33054d8d-8ba6-41f5-96d9-3faebdb07b6a. get_boot_mode_for_deploy /usr/lib/python3.9/site-packages/ironic/drivers/modules/boot_mode_utils.py:250
2024-07-10 15:43:07.642 1 INFO eventlet.wsgi.server [None req-147ba859-d6c1-4cd5-b246-0c61cae62eaa - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3630 time: 0.0049291
2024-07-10 15:43:07.650 1 INFO eventlet.wsgi.server [None req-9bc91057-c057-483f-a313-797770cc3d9a - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/firmware HTTP/1.1" status: 200  len: 719 time: 0.0068102
2024-07-10 15:43:08.461 1 DEBUG ironic.common.pxe_utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Fetching necessary kernel and ramdisk for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a cache_ramdisk_kernel /usr/lib/python3.9/site-packages/ironic/common/pxe_utils.py:1317
2024-07-10 15:43:08.472 1 DEBUG ironic.drivers.modules.pxe_base [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Ramdisk (i)PXE boot for node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a has been prepared with kernel params {'deployment_aki_path': 'http://10.128.152.130:6180/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/deploy_kernel', 'deployment_ari_path': 'http://10.128.152.130:6180/33054d8d-8ba6-41f5-96d9-3faebdb07b6a/deploy_ramdisk', 'initrd_filename': 'deploy_ramdisk', 'aki_path': 'no_kernel', 'ari_path': 'no_ramdisk', 'pxe_append_params': 'nofb nomodeset vga=normal ipa-insecure=1 sshkey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDpywCBE2JPEbLfFRDYdKPmWkqonAnD37PtpGxurFWMN/Pp5xlo453EATCnRsw5GJzllqEdrcSqPoOxfgxYS12oGIU1/kjgdKJJV5rWXTYR0LNb7vwaWljVF1PMObxfNfRZvK9k+DNASsZKHyWhShDv7yAjYQxq7KEvRxRJ4DMSkDxdbCEb7e4pCJfBHlGG9HbppXVbi/bsq91fflz2S0tD4eJumW1LMlSmrB8cm5sLz+szu8U5Nu4ey3EpbcHN/o/QcopMf7gcuPiPdKHvKPkasYjgzquCCvlHcFkDMhR0LL5GfOHgDDBuv0Ugf+86MGT3as702FQmHB7MJa7tqWwB root@bootstrapper"  systemd.journald.forward_to_console=yes ipa-debug=1 ipa-inspection-collectors=default,logs ipa-enable-vlan-interfaces=all ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 ipa-inspection-callback-url=https://10.128.152.130:6385/v1/continue_inspection ipa-api-url=https://10.128.152.130:6385 ipa-global-request-id=req-64f3ce2d-3b2d-49dc-8513-8882a8036404', 'tftp_server': '10.128.152.130', 'ipxe_timeout': 0, 'ipa-api-url': 'https://10.128.152.130:6385'} prepare_ramdisk /usr/lib/python3.9/site-packages/ironic/drivers/modules/pxe_base.py:230
2024-07-10 15:43:08.706 1 INFO ironic.conductor.utils [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a current power state is 'power off', requested state is 'power on'.
2024-07-10 15:43:38.965 1 INFO ironic.conductor.task_manager [None req-64f3ce2d-3b2d-49dc-8513-8882a8036404 - - - - - -] Node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a moved to provision state "inspect wait" from state "inspecting"; target provision state is "manageable"
2024-07-10 15:47:38.376 1 INFO eventlet.wsgi.server [None req-44f61d61-0806-404a-90ee-e3fdfba8fd5c - - - - - -] ::ffff:172.19.0.1 "GET /v1/nodes/33054d8d-8ba6-41f5-96d9-3faebdb07b6a HTTP/1.1" status: 200  len: 3619 time: 0.0049572
2024-07-10 15:47:44.162 1 DEBUG ironic.conductor.task_manager [None req-5ab4fe6c-3ae6-4f31-afcf-a335d02c6606 - - - - - -] Attempting to get shared lock on node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a (for power state sync) __init__ /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:235
2024-07-10 15:47:45.116 1 DEBUG ironic.conductor.task_manager [None req-5ab4fe6c-3ae6-4f31-afcf-a335d02c6606 - - - - - -] Successfully released shared lock for power state sync on node 33054d8d-8ba6-41f5-96d9-3faebdb07b6a (lock was held 0.95 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448

Same with Ironic as well. Above logs are repeated until inspection times out.

kmg1905 commented 1 month ago

From https://docs.openstack.org/ironic/latest/admin/drivers/redfish.html, configuration of redfish driver is

[DEFAULT]
...
enabled_hardware_types = ipmi,redfish
enabled_boot_interfaces = ipxe,redfish-virtual-media,redfish-https
enabled_power_interfaces = ipmitool,redfish
enabled_management_interfaces = ipmitool,redfish
enabled_inspect_interfaces = inspector,redfish

Since redfish requires inspector in the Ironic Config and we removed support for inspector see - #483. Could this be the issue?

Note that Server provisioning works for ILO6 Servers using Ironic v1.1.1

elfosardo commented 1 month ago

@kmg1905 I'm not sure what Ironic v1.1.1 is. I highly doubt #483 is the cause since that was removed starting from ironic-image 25.0.0 and your issue is happening with 24.1.1, also we removed the ironic-inspector support, but we haven't removed inspection, it just happens in a different way. From what I can see there are no errors in the ironic logs, nor in the BMO, so I suspect something goes wrong during the inspection process on the IPA ramdisk. The message about the "power on" can indicate an issue with the communication with the BMC, or even a bug in the ILO6 firmware. I suggest to try to get the ramdisk or console logs to see what's happening on the host during inspection.

elfosardo commented 1 month ago

/triage needs-information

kmg1905 commented 1 month ago

@elfosardo my bad! I was referring to Ironic capm3-v1.1.1 I used this version to bootstrap 25 Gen11 Servers and the bootstrap was successful on all of them. Then I upgraded ironic to v24.1.1 and all the servers got stuck in inspecting. Hence, I thought the issue might be with Ironic. From v24.1.1 we use agent instead of inspector in Ironic config #479 - disabled Inspector by default. Curious if we change USE_IRONIC_INSPECTOR:-true fixes the issue since https://docs.openstack.org/ironic/latest/admin/drivers/redfish.html recommends using inspector in the ironic-config. Regarding console logs I'll try to capture them and attach it to this issue, thanks!

elfosardo commented 1 month ago

@kmg1905 more clear now, thanks I don't think capm3 v1.1.1 is still supported though, and you may be experiencing some conflicts between components. I strongly suggest moving to a more recent version according to https://book.metal3.io/version_support

kmg1905 commented 1 month ago

@elfosardo yes, I have upgraded our Ironic stack based on recommendation from here - https://book.metal3.io/version_support I had this issue (Servers stuck in inspecting) only in ILO6 Servers (ILO5 provisioning passed) after I upgraded the Ironic stack. As a workaround, I created a local patch which is revert of #479 and then tested provisioning on ILO6 servers. After this patch, provisioning passed on all 24 ILO6 servers. Hence, enabling inspector worked for ILO5 and ILO6 servers.

elfosardo commented 1 month ago

@kmg1905 in https://docs.openstack.org/ironic/latest/admin/drivers/redfish.html we don't recommend using inspector in the ironic-config, that is just an example of configuration, also please consider that the redfish and inspector interfaces are mutually exclusive as you can read here https://docs.openstack.org/ironic/latest/admin/drivers/redfish.html#out-of-band-inspection, redfish being out-of-band, while inspector being in-band. What you're doing re-enabling inspector is just using a different inspection method, which is not supported anymore by the way, and the fact that it works for you it means that something not right in what you have in your deployment in terms of containers, since we're not installing inspector anymore and its functionality has been integrated directly into ironic with the new "agent" interface which is now the default sinece stable release 25, as you can see from here https://github.com/metal3-io/ironic-image/commit/4fe398f0e0e256564c626c64833a64bcb41b9c16