ironcore-dev / metal-operator

Kubernetes operator for automating bare metal server discovery and provisioning
Apache License 2.0
7 stars 4 forks source link

Dell IDRAC9: unreliable boot override #80

Closed defo89 closed 1 week ago

defo89 commented 1 month ago

Describe the bug Despite one time boot params being set via Redfish, server would still boot from hard drive.

2024-07-12T07:50:33.409624088Z 2024-07-12T07:50:33Z DEBUG   Booted Server in PXE    {"controller": "server", "controllerGroup": "metal.ironcore.dev", "controllerKind": "Server", "Server": {"name":"compute-0-bmc-node003-ap052"}, "namespace": "", "name": "compute-0-bmc-node003-ap052", "reconcileID": "5f7131f8-89a0-4000-949a-35201cb7b18c"}
2024-07-12T07:50:33.419884725Z 2024-07-12T07:50:33Z DEBUG   Extracted Server details    {"controller": "server", "controllerGroup": "metal.ironcore.dev", "controllerKind": "Server", "Server": {"name":"compute-0-bmc-node003-ap052"}, "namespace": "", "name": "compute-0-bmc-node003-ap052", "reconcileID": "5f7131f8-89a0-4000-949a-35201cb7b18c"}
2024-07-12T07:50:33.420478758Z 2024-07-12T07:50:33Z DEBUG   Reconciling BMC {"controller": "bmc", "controllerGroup": "metal.ironcore.dev", "controllerKind": "BMC", "BMC": {"name":"bmc-node003-ap052"}, "namespace": "", "name": "bmc-node003-ap052", "reconcileID": "2f25026e-723f-4b56-9a9f-a25de0282b00"}
2024-07-12T07:50:33.420499572Z 2024-07-12T07:50:33Z DEBUG   Got Endpoints for BMC   {"controller": "bmc", "controllerGroup": "metal.ironcore.dev", "controllerKind": "BMC", "BMC": {"name":"bmc-node003-ap052"}, "namespace": "", "name": "bmc-node003-ap052", "reconcileID": "2f25026e-723f-4b56-9a9f-a25de0282b00", "Endpoints": "node003-ap052"}
2024-07-12T07:50:33.440756258Z 2024-07-12T07:50:33Z DEBUG   Updated Server power state  {"controller": "server", "controllerGroup": "metal.ironcore.dev", "controllerKind": "Server", "Server": {"name":"compute-0-bmc-node003-ap052"}, "namespace": "", "name": "compute-0-bmc-node003-ap052", "reconcileID": "5f7131f8-89a0-4000-949a-35201cb7b18c", "PowerState": "Off"}
› curl -k  https://<bmc>/redfish/v1/Systems/System.Embedded.1 -u support | jq .Boot
{
  "BootOptions": {
    "@odata.id": "/redfish/v1/Systems/System.Embedded.1/BootOptions"
  },
  "Certificates": {
    "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Boot/Certificates"
  },
  "BootOrder": [
    "Boot0000",
    "Boot0005",
    "Boot0002",
    "Boot0001",
    "Boot0003",
    "Boot0004"
  ],
  "BootOrder@odata.count": 6,
  "BootSourceOverrideEnabled": "Once",       <<<<
  "BootSourceOverrideMode": "UEFI",            <<<<
  "BootSourceOverrideTarget": "Pxe",             <<<<
  "UefiTargetBootSourceOverride": null,
  "BootSourceOverrideTarget@Redfish.AllowableValues": [
    "None",
    "Pxe",
    "Floppy",
    "Cd",
    "Hdd",
    "BiosSetup",
    "Utilities",
    "UefiTarget",
    "SDCard",
    "UefiHttp"
  ],
  "StopBootOnFault": "Never"
}

Expected behavior Investigate options for reliably overriding one time boot params for Dell servers.

Additional context Encountered this on 2 available lab servers:

Model: PowerEdge R660
BIOS Version: 1.6.6
iDRAC Firmware Version: 7.10.30.05
afritzler commented 1 month ago

I guess we should start separating the bmc implementation into a Lenovo and Dell version to reflect those different behaviors. Similar like we are doing it for the redfish_local (https://github.com/ironcore-dev/metal-operator/tree/main/bmc)

stefanhipfel commented 1 month ago

added it to our internal GCS Compute DELL board

defo89 commented 1 week ago

should be fixed by https://github.com/ironcore-dev/metal-operator/pull/107