harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.68k stars 310 forks source link

[BUG] Out-of-Band Access on Host (harvester-seeder addon) fails to retrieve updated SSL cert #4629

Open irishgordo opened 9 months ago

irishgordo commented 9 months ago

Describe the bug This seems to occur on the case where the user has underlying bare-metal infrastructure that has an expired SSL cert. The user then enables harvester-seeder addon & configures on host with credentials. Yielding:

failed to open connection to BMC: 5 errors occurred:
* provider: gofish: Get "https://192.168.9.118/redfish/v1/": x509: certificate has expired or is not yet valid: current time 2023-10-17T21:26:10Z is after 2021-06-04T23:54:19Z
* provider: ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session: exit status 1
* provider: *asrockrack.ASRockRack: Error logging in: Post "https://192.168.9.118/api/session": x509: certificate has expired or is not yet valid: current time 2023-10-17T21:27:02Z is after 2021-06-04T23:54:19Z
* provider: IntelAMT: Unable to perform digest auth with http://192.168.9.118:443/wsman: Post "http://192.168.9.118:443/wsman": EOF
* no Opener implementations found

Then the user updates the certificate / rolling the cert on the bare-metal device (node that provides Harvester). Disabling & Re-enabling on the Host the out-of-band access. And while the cert may not be expired anymore on the bare-metal the Harvester host still thinks it is. Adjusting polling interval, enabling or disabling the out-of-band access on the host seems to not have any effect.

To Reproduce Pre-Reqs:

Expected behavior Somehow have a "refresh" or dumping of any saved certs for the server/host if out-of-band access is enabled/disabled on the given host.

Support bundle supportbundle_09c663e9-9569-4362-9a3e-cc3a2d703f01_2023-10-17T21-37-08Z.zip

Environment

Additional context Screenshot from 2023-10-17 12-45-52 Screenshot from 2023-10-17 12-45-20 Screenshot from 2023-10-17 12-29-53 Screenshot from 2023-10-17 12-04-21 Screenshot from 2023-10-17 11-40-09

╭─mike at suse-workstation-team-harvester in ~/Programs/dell/iDRACTools/racadm/RHEL8/x86_64
╰─○ sudo racadm -r 192.168.9.118 -u root -p root set iDRAC.Webserver.HttpsRedirection Disabled
Security Alert: Certificate is invalid - EE certificate key too weak
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
[Key=iDRAC.Embedded.1#WebServer.1]                                           
Object value modified successfully

╭─mike at suse-workstation-team-harvester in ~/Programs/dell/iDRACTools/racadm/RHEL8/x86_6╭─mike╭─mike╭─mike at suse-wor╭─mike at suse-workstation-team-harvester in ~/Programs/dell/iDRACTools/racadm/RHEL8/x86_64
╰─○ sudo racadm -r 192.168.9.118 -u root -p root sslresetcfg                                  
[sudo] password for mike: 
Security Alert: Certificate is invalid - EE certificate key too weak
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
Certificate regenerated successfully and webserver restarted        
irishgordo commented 9 months ago

Also as a workaround did try - disabling the out of band at the host - disabling the harvester-seeder addon. Then re-enabling harvester-seeder addon, re-enabling the out-of-band access at the host (re-leveraging the already created secret) but still hitting the issue of:

failed to open connection to BMC: 5 errors occurred:
* provider: gofish: Get "https://192.168.9.118/redfish/v1/": x509: certificate has expired or is not yet valid: current time 2023-10-18T19:52:58Z is after 2021-06-04T23:54:19Z
* provider: ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session: exit status 1
* provider: *asrockrack.ASRockRack: Error logging in: Post "https://192.168.9.118/api/session": x509: certificate has expired or is not yet valid: current time 2023-10-18T19:53:51Z is after 2021-06-04T23:54:19Z
* provider: IntelAMT: Unable to perform digest auth with http://192.168.9.118:443/wsman: Post "http://192.168.9.118:443/wsman": EOF
* no Opener implementations found
irishgordo commented 1 month ago

So to offer more context on this. :thread:

With Harvester v1.2.2, it is noticable that IPMI based Alerts/Events can come accross on a Dell PowerEdge R720 -w/ the settings configured correctly in iDRAC 7.

Though... to be noted.

RedFish fails entirely, even if the service is up when selecting "insecure TLS" box on the front-end. As RedFish is using port 443, by default with iDRAC... granted, at least with iDRAC 7, and from what I can tell, though mileage might vary on that...

What seems to happen is indeed the same thing. That there is an x509 error with the gofish library.

Where it's apparent from like a GET to https://192.168.11.118/redfish/v1 - we do yield back data, x-ref:

{
  "@odata.context": "/redfish/v1/$metadata#ServiceRoot.ServiceRoot",
  "@odata.id": "/redfish/v1",
  "@odata.type": "#ServiceRoot.v1_3_0.ServiceRoot",
  "AccountService": {
    "@odata.id": "/redfish/v1/Managers/iDRAC.Embedded.1/AccountService"
  },
  "Chassis": {
    "@odata.id": "/redfish/v1/Chassis"
  },
  "Description": "Root Service",
  "EventService": {
    "@odata.id": "/redfish/v1/EventService"
  },
  "Fabrics": {
    "@odata.id": "/redfish/v1/Fabrics"
  },
  "Id": "RootService",
  "JsonSchemas": {
    "@odata.id": "/redfish/v1/JSONSchemas"
  },
  "Links": {
    "Sessions": {
      "@odata.id": "/redfish/v1/Sessions"
    }
  },
  "Managers": {
    "@odata.id": "/redfish/v1/Managers"
  },
  "Name": "Root Service",
  "Oem": {
    "Dell": {
      "@odata.type": "#DellServiceRoot.v1_0_0.ServiceRootSummary",
      "IsBranded": 1,
      "ManagerMACAddress": "C8:1F:66:B7:B2:12",
      "ServiceTag": "6PGZDZ1"
    }
  },
  "Product": "Integrated Remote Access Controller",
  "ProtocolFeaturesSupported": {
    "ExpandQuery": {
      "ExpandAll": true,
      "Levels": true,
      "Links": true,
      "MaxLevels": 1,
      "NoLinks": true
    },
    "FilterQuery": true,
    "SelectQuery": true
  },
  "RedfishVersion": "1.4.0",
  "Registries": {
    "@odata.id": "/redfish/v1/Registries"
  },
  "SessionService": {
    "@odata.id": "/redfish/v1/SessionService"
  },
  "Systems": {
    "@odata.id": "/redfish/v1/Systems"
  },
  "Tasks": {
    "@odata.id": "/redfish/v1/TaskService"
  },
  "UpdateService": {
    "@odata.id": "/redfish/v1/UpdateService"
  }
}

Additionally, checking our event service with a GET out to https://192.168.11.118/redfish/v1/EventService yields:

{
    "@odata.context": "/redfish/v1/$metadata#EventService.EventService",
    "@odata.id": "/redfish/v1/EventService",
    "@odata.type": "#EventService.v1_0_6.EventService",
    "Actions": {
        "#EventService.SubmitTestEvent": {
            "EventType@Redfish.AllowableValues": [
                "StatusChange",
                "ResourceUpdated",
                "ResourceAdded",
                "ResourceRemoved",
                "Alert"
            ],
            "target": "/redfish/v1/EventService/Actions/EventService.SubmitTestEvent"
        }
    },
    "DeliveryRetryAttempts": 5,
    "DeliveryRetryIntervalSeconds": 30,
    "Description": "Event Service represents the properties for the service",
    "EventTypesForSubscription": [
        "StatusChange",
        "ResourceUpdated",
        "ResourceAdded",
        "ResourceRemoved",
        "Alert"
    ],
    "EventTypesForSubscription@odata.count": 5,
    "Id": "EventService",
    "Name": "Event Service",
    "ServiceEnabled": true,
    "Status": {
        "Health": "OK",
        "HealthRollup": "OK",
        "State": "Enabled"
    },
    "Subscriptions": {
        "@odata.id": "/redfish/v1/EventService/Subscriptions"
    }
}

But! To note, IPMI, over port 623 with Dell iDRAC does not pose any issues - but the only issues seem to be present in the gofish library, as it's not respecting the "ignore certs" / "insecure TLS" checkbox on the front-end, as if it was, it wouldn't return back x509 cert issues as yes, in this case the cert is self-signed. The self-signed cert is even noticable in logs on racadm:

╭─mike at suse-workstation-team-harvester in ~/Projects/moritz-baremetal
╰─○ sudo racadm -r 192.168.11.118 -u root -p root get iDRAC.RedfishEventing.IgnoreCertificateErrors             
Security Alert: Certificate is invalid - self-signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
[Key=iDRAC.Embedded.1#RedfishEventing.1]                                     
IgnoreCertificateErrors=Yes

╭─mike at suse-workstation-team-harvester in ~/Projects/moritz-baremetal
╰─○ sudo racadm -r 192.168.11.118 -u root -p root get iDRAC.RedfishEventing                        
Security Alert: Certificate is invalid - self-signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
[Key=iDRAC.Embedded.1#RedfishEventing.1]                                     
DeliveryRetryAttempts=5
DeliveryRetryIntervalInSeconds=30
#IgnoreCertificateErrors=Yes

Screenshot from 2024-06-03 13-55-03 Screenshot from 2024-06-03 13-54-24 Screenshot from 2024-06-03 13-53-40 Screenshot from 2024-06-03 13-51-23 Screenshot from 2024-06-03 13-48-13 Screenshot from 2024-06-03 13-47-47

Also to note, the Inventory, of course on 443, does complain kubectl get inventories -A w/ Describe on the one:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: metal.harvesterhci.io/v1alpha1
kind: Inventory
metadata:
  annotations:
    metal.harvesterhci.io/local-inventory: "true"
    metal.harvesterhci.io/local-node-name: dell-r720-node
  creationTimestamp: "2024-06-03T19:49:43Z"
  finalizers:
  - finalizer.inventory.metal.harvesterhci.io
  generation: 6
  name: dell-r720-node
  namespace: harvester-system
  resourceVersion: "73832"
  uid: 7dd4e2d9-4a2d-4ee5-8cf5-e4f6e9eec7a1
spec:
  baseboardSpec:
    connection:
      authSecretRef:
        name: idrac
        namespace: default
      host: 192.168.11.118
      insecureTLS: true
      port: 443
  events:
    enabled: true
    pollingInterval: 1h
  managementInterfaceMacAddress: ""
  primaryDisk: ""
status:
  conditions:
  - lastUpdateTime: "2024-06-03T20:52:19Z"
    status: "True"
    type: bmcObjectCreated
  - lastUpdateTime: "2024-06-03T19:49:44Z"
    status: "False"
    type: bmcJobSubmitted
  - lastUpdateTime: "2024-06-03T19:49:44Z"
    status: "False"
    type: bmcJobCompleted
  - lastUpdateTime: "2024-06-03T19:49:50Z"
    status: "True"
    type: inventoryAllocatedToCluster
  - lastUpdateTime: "2024-06-03T20:53:12Z"
    message: "failed to open connection to BMC: 5 errors occurred:\n\t* provider:
      gofish: Get \"https://192.168.11.118/redfish/v1/\": tls: failed to verify certificate:
      x509: cannot validate certificate for 192.168.11.118 because it doesn't contain
      any IP SANs\n\t* provider: ipmitool: Error: Unable to establish IPMI v2 / RMCP+
      session: exit status 1\n\t* provider: *asrockrack.ASRockRack: Error logging
      in: Post \"https://192.168.11.118/api/session\": tls: failed to verify certificate:
      x509: cannot validate certificate for 192.168.11.118 because it doesn't contain
      any IP SANs\n\t* provider: IntelAMT: Unable to perform digest auth with http://192.168.11.118:443/wsman:
      Post \"http://192.168.11.118:443/wsman\": EOF\n\t* no Opener implementations
      found\n\n"
    reason: Error
    status: "True"
    type: machineNotContactable
  hardwareID: 2b37edae-21eb-11ef-9f92-ead8a4811903
  ownerCluster:
    name: ""
    namespace: ""
  powerAction: {}
  pxeBootConfig:
    address: 192.168.104.169

Wondering if maybe something with gofish bmcclientlib is an issue: https://github.com/harvester/seeder/blob/6f07c186f9fe0732c2d3d921ab0a78a3dd6ce907/pkg/controllers/setup.go#L192-L210

Default transport for http client seems to ensure insecure skip verify is enabled on the bmc-client-lib: https://github.com/bmc-toolbox/bmclib/blob/77eee83ecf866d895be464887ca76598a9645a87/internal/httpclient/httpclient.go#L34-L45

cc: @ibrokethecloud

IASN-CCC commented 1 month ago

Hi, did you get this working in the end, we have a dell server also and when enabling the BMS we get the same red error box you show above