KCP Doesn't Remediate Faulty Machines During Cluster Formation

jweite-amazon commented 1 year ago

What steps did you take and what happened:

Defined a CAPI/CAPC cluster with three CAPC failure domains, one of which used a network that was not routable, to simulate a transient network failure our client experiences.
Launched a single replica KCP and confirmed its machine became a node. (Repeated if it was assigned to the "bad" failure domain until it was assigned to one of the two "good" ones.)
Installed a CNI (cilium) and a MHC with maxUnhealthy==100%, 5m startup timeout and detection of Unknown and False unhealthy conditions with 5m timeout.
Added a worker machine/node (successfully).
Scaled-up the KCP to three replicas.
Observed that only a single machine was created, on the "bad" FD, which only achieved the Provisioned state. (Has not joined the cluster and become a Node).
Observed that this machine had condition NodeHealthy==False.
Observed that this machine did not have conditions APIServerPodHealthy, ControllerManagerPodHealthy, SchedulerPodHealthy, EtcdPodHealthy or EtcdMemberHealthy.
Observed that this machine is not remediated by KCP after 15 minutes.
Observed the following recurring logging messages from KCP Manager: I1104 14:18:03.141623 1 controller.go:364] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="default/jweite-test-control-plane" namespace="default" name="jweite-test-control-plane" reconcileID=5f541f90-9549-496e-81c0-9befe23c1994 cluster="jweite-test" Desired=3 Existing=2 I1104 14:18:03.141831 1 scale.go:212] "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="jweite-test" "name"="jweite-test-control-plane" "namespace"="default" "failures"="[machine jweite-test-control-plane-zqgfk does not have APIServerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have ControllerManagerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have SchedulerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdMemberHealthy condition]"

What did you expect to happen: The KCP to remediate the bad machine by deleting it.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

From my code read of controlPlane/kubeadm/internal/controller/remediation.go reconcileUnhealthyMachines() insists that the cluster be fully formed (provisioned machines == desired replicas) before it will act. But the cluster cannot fully form if the machine successfully started cannot join the cluster because of an external issue such as the one I simulated. IMO remediation would be an appropriate response to this situation.

Environment:

Cluster-api version: v1.2.4
minikube/kind version: v0.11.1 go1.16.4 darwin/amd64
Kubernetes version: (use kubectl version): v1.20.10
OS (e.g. from /etc/os-release): Darwin: MacOS 12.6.1

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

killianmuldoon commented 1 year ago

This is as-designed right now - KCP will not do re-mediation based on MHC until there are at least the number of healthy desired KCP machines running. This is to ensure stability when a cluster is coming up. On your unhealthy machine you should see a log like:

KCP waiting for having at least 3 control plane machines before triggering remediation

If that's there then MHC is correctly labelling the machine for remediation, but KCP is specifically deciding not to remediate until there is a stable control plane.

That said if there's a safe, stable way to do this it could be interesting. One option today is to implement externalRemediation to manage this outside of core Cluster API. It's a hard problem as when the underlying infrastructure isn't working, it's likely another Control Plane Machine will also fail as there's a real environment issue in your case the network being cut off for one of the KCP nodes.

jweite-amazon commented 1 year ago

Thanks for that feedback @killianmuldoon. I certainly don't know the basis behind the design decision here (i.e., why remediating during CP formation is risky). Its downside, as demonstrated, is that the partially provisioned CP will remain stuck in that state: the new CP machine can never join the cluster, and CAPI keeps waiting for it to. Stable, yes, but not in a useful way. I'd like CAPI to be able to recover from provisioning problems occurring during cluster formation that it "knows how to" recover from after cluster formation completes.

Can you or anyone shed more light on the risk of remediating during CP formation?

killianmuldoon commented 1 year ago

The major risk at this point is that the etcd cluster is knocked into a state that it can't automatically recover from - e.g losing the leader, losing the majority.

Given that this is happening at bootstrap time it's probably easier and faster to just automatically restart if you're confident the KCP machine failure is something flaky, rather than something clearly wrong with the underlying infrastructure.

fabriziopandini commented 1 year ago

/triage accepted

I agree this is an interesting new use case to cover if we can find a safe, stable way to do this

Some context that I hope can help in shaping the discussion:

KCP remediation was not originally designed for acting with less than 3 nodes;
in a follow-up iteration, we relaxed some of the original constraints in order to support remediation errors during the rollout of single machines control planes see https://github.com/kubernetes-sigs/cluster-api/pull/4591 and https://github.com/kubernetes-sigs/cluster-api/pull/4594
remediation during cluster formation wasn't a use case considered in the original design nor in the follow-up iteration.
by reading the comment on the line highlighted above, it seems to me that this check has been implemented in the first iteration to prevent KCP to remediate "aggressively" when there is more than 1 machine with problems; in other words, remediate 1 failing machine, restore desired replicas, then remediate the next one instead of remediate all the failing machines in sequence "aggressively" downsizing the CP

Now, as reported above, the last condition prevents remediation during cluster formation; before relaxing this check in this new iteration, IMO we should address at least the following questions:

if aggressive remediation is still a concern, and if yes, how to continue to prevent it while allowing remediation during cluster formation
if there are other use cases where current replicas < desired replicas, e.g. when doing a rollout with scale in strategy, and if/how the proposed change impacts those use cases

/area control-plane /remove-kind bug /kind feature

fabriziopandini commented 1 year ago

/assign

I'm working to some idea to solve this problem; I will follow up with some more details here or in PR with an amendment to the KCP proposal

fabriziopandini commented 1 year ago

https://github.com/kubernetes-sigs/cluster-api/pull/7855 proposes an amendment to the KCP proposal so it will be possible to remediate failure happening while provisioning the CP (both first CP and CP machines while current replica < desired replica).

In order to make this more robust/not aggressive on the infrastructure (e.g. avoid infinite remediation if the first machine fails consistently) I have added optional support for controlling the number of retry and a delay between each retry. I'm working on a PR that implements the proposed changes

kubernetes-sigs / cluster-api

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496