BMH is stuck in inspecting state

ss2901 commented 6 months ago

I am using DELL server for deployment and it is getting stuck on inspecting state. Further, suse image (SLES15) is used for booting. Also, tried with Ubuntu 22.04. Deployment is getting failed with Inspection error which says that timeout reached while inspecting the node

Events: Normal InspectionStarted 34m metal3-baremetal-controller Hardware inspection started Normal InspectionError 4m35s metal3-baremetal-controller timeout reached while inspecting the node Normal InspectionStarted 4m33s metal3-baremetal-controller Hardware inspection started

And on IDRAC, virtual console, it is stuck on unable to access console, root account is locked. Though, I have checked the credentials for root it is working fine.

Also, below is the yaml format of bmh:

$ kubectl get bmh -A -o yaml
    apiVersion: v1
    items:
    - apiVersion: metal3.io/v1alpha1
       kind: BareMetalHost
  metadata:
    annotations:
      meta.helm.sh/release-name: cluster-bmh
      meta.helm.sh/release-namespace: my-rke2-capm3
      sylvaproject.org/baremetal-host-name: my-server
      sylvaproject.org/cluster-name: my-rke2-capm3
      sylvaproject.org/default-longhorn-disks-config: '[{ "path":"/var/longhorn/disks/sdb","storageReserved":0,"allowScheduling":true,"tags":[
        "ssd", "fast" ] },{ "path":"/var/longhorn/disks/sdc","storageReserved":0,"allowScheduling":true,"tags":[
        "ssd", "fast" ] } ]'
    creationTimestamp: "2024-04-29T20:54:34Z"
    finalizers:
    - baremetalhost.metal3.io
    generation: 2
    labels:
      app.kubernetes.io/instance: cluster-bmh
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: sylva-capi-cluster
      app.kubernetes.io/version: 0.0.0
      cluster-role: control-plane
      helm.sh/chart: sylva-capi-cluster-0.0.0_ab1e5edb7f30
      helm.toolkit.fluxcd.io/name: cluster-bmh
      helm.toolkit.fluxcd.io/namespace: my-rke2-capm3
      host-type: generic
    name: my-rke2-capm3-my-server
    namespace: my-rke2-capm3
    resourceVersion: "1180427"
    uid: 76e9268e-b807-42b6-ac07-0feb8013e18d
  spec:
    automatedCleaningMode: metadata
    bmc:
      address: redfish://<bmc-address>/redfish/v1/Systems/System.Embedded.1
      credentialsName: my-rke2-capm3-my-server-secret
      disableCertificateVerification: true
    bootMACAddress: <mac-address>
    bootMode: UEFI
    description: Dell M640 Blade Server
    online: true
    rootDeviceHints:
      hctl: "0:0:0:0"
  status:
    errorCount: 12
    errorMessage: ""
    goodCredentials:
      credentials:
        name: my-rke2-capm3-my-server-secret
        namespace: my-rke2-capm3
      credentialsVersion: "238295"
    hardwareProfile: unknown
    lastUpdated: "2024-04-30T06:07:58Z"
    operationHistory:
      deprovision:
        end: null
        start: null
      inspect:
        end: null
        start: "2024-04-29T20:54:53Z"
      provision:
        end: null
        start: null
      register:
        end: "2024-04-29T20:54:53Z"
        start: "2024-04-29T20:54:36Z"
    operationalStatus: OK
    poweredOn: false
    provisioning:
      ID: 647bb414-2678-4bfb-9782-66b99adcdd6f
      bootMode: UEFI
      image:
        url: ""
      rootDeviceHints:
        hctl: "0:0:0:0"
      state: inspecting
    triedCredentials:
      credentials:
        name: my-rke2-capm3-my-server-secret
        namespace: my-rke2-capm3
      credentialsVersion: "238295"
kind: List
metadata:
  resourceVersion: ""

/kind bug

metal3-io-bot commented 6 months ago

This issue is currently awaiting triage. If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance. The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

dtantsur commented 6 months ago

Hi! The SSH public key for the root account on the inspection/deployment ramdisk can be passed to the ironic image (IRONIC_RAMDISK_SSH_KEY variable). Not sure it that's what you did, mentioning it for completeness.

Once you get it, the first thing to check is networking. In the vast majority of cases, what you observe is caused by inability of the ramdisk to reach back to ironic on the provisioning network. The other end of it is the dnsmasq container in the metal3 pod - you can even start by checking its logs. If it's empty or does not mention the provided bootMACAddress, chances are high the DHCP traffic is not reaching Metal3 on the provisioning network.

I hope these hints help.

matthewei commented 5 months ago

Could you login the BMC to double check the console log?

Rozzii commented 5 months ago

I would like to also ask for the logs of the Ironic container of the Ironic pod @ss2901 please.

Rozzii commented 5 months ago

/triage needs-information

metal3-io-bot commented 2 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

metal3-io-bot commented 1 month ago

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

metal3-io-bot commented 1 month ago

@metal3-io-bot: Closing this issue.

In response to [this](https://github.com/metal3-io/baremetal-operator/issues/1706#issuecomment-2346503227): >Stale issues close after 30d of inactivity. Reopen the issue with `/reopen`. Mark the issue as fresh with `/remove-lifecycle stale`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

metal3-io / baremetal-operator

BMH is stuck in inspecting state #1706