Machine stuck in Failed state - Failed to get a BaremetalHost for the Metal3Machine

lmarti1505 commented 1 month ago

What steps did you take and what happened: During cluster creation two of our machine nodes got stuck in the State failed. Although the Metal3Machin in the backend is associated with an BMH and works just fine. The node was properly joined to the capi-cluster and is also used for scheduling pods.

Example status of one Machine:

status:
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2024-09-28T03:40:36Z"
    message: 'FailureReason: CreateError'
    reason: MachineHasFailure
    severity: Warning
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-07-22T16:05:49Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2024-09-28T03:40:36Z"
    message: 'FailureReason: CreateError'
    reason: MachineHasFailure
    severity: Warning
    status: "False"
    type: HealthCheckSucceeded
  - lastTransitionTime: "2024-09-28T03:42:08Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-17T11:54:55Z"
    status: "True"
    type: NodeHealthy
  failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1beta1,
    Kind=Metal3Machine with name "roaming-1-pool-default-4s9d7-4fd6g": Failed to get
    a BaremetalHost for the Metal3Machine'
  failureReason: CreateError
  infrastructureReady: true

But the Metal3Machine below is provided and associated with an Baremetalhost. Example Status Metal3Machine:

status:
  conditions:
  - lastTransitionTime: "2024-09-28T03:42:07Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-28T03:42:07Z"
    status: "True"
    type: AssociateBMH
  - lastTransitionTime: "2024-07-22T16:24:07Z"
    status: "True"
    type: KubernetesNodeReady
  - lastTransitionTime: "2024-07-22T16:06:31Z"
    status: "True"
    type: Metal3DataReady

What did you expect to happen: It seems like the current ready status of the metal3machine is not synchronizing to the machine, so the machine gets out of Failed Status.

Maybe the metal3machine did take to long to get ready and the machine status got to failed, but it never updated this status when the metal3machine got ready.

Anything else you would like to add: One thing to add, currently the remediation is not allowed due to our maxUnhealthy value for the machinedeployment the stuck machines are part of. So thats why the Failed nodes won´t be replaced with new ones, but it would be great if the node doesn´t need to be replaced when the node is actually healthy and already runs some workload.

Environment: Reference Environment

Cluster-api version: 1.7.2
Cluster-api-provider-metal3 version: 1.7.0
Environment (metal3-dev-env or other):
Kubernetes version: (use kubectl version): 1.29.4

/kind bug

metal3-io-bot commented 1 month ago

This issue is currently awaiting triage. If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance. The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

dtantsur commented 1 week ago

Hi! Could you please attach the relevant parts of the CAPM3 logs?

/cc @lentzi90

lentzi90 commented 1 week ago

This behavior has been the source of a lot of confusion both for developers and users. CAPI implemented it so that Failures are permanent, on all other cases we should just use Conditions. So this is an error on our part setting the FailureMessage for something that isn't permanent.

Fortunately, CAPI are now getting rid of these all together. Going forward, only Conditions will be used. That will solve this issue since CAPM3 will no longer be able to use FailureMessages.

If it is important to fix this particular case we can try to do that, but I would suggest that we instead focus on moving forward and it will go away by itself in the next minor release.

eumel8 commented 1 day ago

Hi @lentzi90 ,

sounds reasonable. Any timeline for next minor release?

lentzi90 commented 1 day ago

Well CAPI will probably do the v1.9 release this week. After that I think it will take a few weeks for us to adapt to it. If all goes well we may have it out before holidays, otherwise Q1 2025. No promises :wink:

metal3-io / cluster-api-provider-metal3

Machine stuck in Failed state - Failed to get a BaremetalHost for the Metal3Machine #2055