kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.55k stars 1.3k forks source link

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

Closed jweite-amazon closed 1 year ago

jweite-amazon commented 1 year ago

What steps did you take and what happened:

What did you expect to happen: The KCP to remediate the bad machine by deleting it.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

From my code read of controlPlane/kubeadm/internal/controller/remediation.go reconcileUnhealthyMachines() insists that the cluster be fully formed (provisioned machines == desired replicas) before it will act. But the cluster cannot fully form if the machine successfully started cannot join the cluster because of an external issue such as the one I simulated. IMO remediation would be an appropriate response to this situation.

Environment:

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

killianmuldoon commented 1 year ago

This is as-designed right now - KCP will not do re-mediation based on MHC until there are at least the number of healthy desired KCP machines running. This is to ensure stability when a cluster is coming up. On your unhealthy machine you should see a log like:

KCP waiting for having at least 3 control plane machines before triggering remediation

If that's there then MHC is correctly labelling the machine for remediation, but KCP is specifically deciding not to remediate until there is a stable control plane.

That said if there's a safe, stable way to do this it could be interesting. One option today is to implement externalRemediation to manage this outside of core Cluster API. It's a hard problem as when the underlying infrastructure isn't working, it's likely another Control Plane Machine will also fail as there's a real environment issue in your case the network being cut off for one of the KCP nodes.

jweite-amazon commented 1 year ago

Thanks for that feedback @killianmuldoon. I certainly don't know the basis behind the design decision here (i.e., why remediating during CP formation is risky). Its downside, as demonstrated, is that the partially provisioned CP will remain stuck in that state: the new CP machine can never join the cluster, and CAPI keeps waiting for it to. Stable, yes, but not in a useful way. I'd like CAPI to be able to recover from provisioning problems occurring during cluster formation that it "knows how to" recover from after cluster formation completes.

Can you or anyone shed more light on the risk of remediating during CP formation?

killianmuldoon commented 1 year ago

The major risk at this point is that the etcd cluster is knocked into a state that it can't automatically recover from - e.g losing the leader, losing the majority.

Given that this is happening at bootstrap time it's probably easier and faster to just automatically restart if you're confident the KCP machine failure is something flaky, rather than something clearly wrong with the underlying infrastructure.

fabriziopandini commented 1 year ago

/triage accepted

I agree this is an interesting new use case to cover if we can find a safe, stable way to do this

Some context that I hope can help in shaping the discussion:

Now, as reported above, the last condition prevents remediation during cluster formation; before relaxing this check in this new iteration, IMO we should address at least the following questions:

/area control-plane /remove-kind bug /kind feature

fabriziopandini commented 1 year ago

/assign

I'm working to some idea to solve this problem; I will follow up with some more details here or in PR with an amendment to the KCP proposal

fabriziopandini commented 1 year ago

https://github.com/kubernetes-sigs/cluster-api/pull/7855 proposes an amendment to the KCP proposal so it will be possible to remediate failure happening while provisioning the CP (both first CP and CP machines while current replica < desired replica).

In order to make this more robust/not aggressive on the infrastructure (e.g. avoid infinite remediation if the first machine fails consistently) I have added optional support for controlling the number of retry and a delay between each retry. I'm working on a PR that implements the proposed changes