Open lmarti1505 opened 1 month ago
This issue is currently awaiting triage.
If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Hi! Could you please attach the relevant parts of the CAPM3 logs?
/cc @lentzi90
This behavior has been the source of a lot of confusion both for developers and users. CAPI implemented it so that Failures are permanent, on all other cases we should just use Conditions. So this is an error on our part setting the FailureMessage for something that isn't permanent.
Fortunately, CAPI are now getting rid of these all together. Going forward, only Conditions will be used. That will solve this issue since CAPM3 will no longer be able to use FailureMessages.
If it is important to fix this particular case we can try to do that, but I would suggest that we instead focus on moving forward and it will go away by itself in the next minor release.
Hi @lentzi90 ,
sounds reasonable. Any timeline for next minor release?
Well CAPI will probably do the v1.9 release this week. After that I think it will take a few weeks for us to adapt to it. If all goes well we may have it out before holidays, otherwise Q1 2025. No promises :wink:
What steps did you take and what happened: During cluster creation two of our machine nodes got stuck in the State failed. Although the Metal3Machin in the backend is associated with an BMH and works just fine. The node was properly joined to the capi-cluster and is also used for scheduling pods.
Example status of one Machine:
But the Metal3Machine below is provided and associated with an Baremetalhost. Example Status Metal3Machine:
What did you expect to happen: It seems like the current ready status of the metal3machine is not synchronizing to the machine, so the machine gets out of Failed Status.
Maybe the metal3machine did take to long to get ready and the machine status got to failed, but it never updated this status when the metal3machine got ready.
Anything else you would like to add: One thing to add, currently the remediation is not allowed due to our maxUnhealthy value for the machinedeployment the stuck machines are part of. So thats why the Failed nodes won´t be replaced with new ones, but it would be great if the node doesn´t need to be replaced when the node is actually healthy and already runs some workload.
Environment: Reference Environment
kubectl version
): 1.29.4/kind bug