Incorrect etcd members remediation in KCP after controller failure.

maelk commented 3 years ago

What steps did you take and what happened: A control plane machine was marked as unhealthy for remediation. KCP started to process it and removed the etcd member associated and removed the machine from the kubeadm configmap. For a reason that is not completely clear yet, the machine deletion did not happen and the KCP controller got restarted at that point. The new instance then refuses to remediate the machine further since the etcd cluster only has two nodes (the member associated with the machine marked for remediation was already removed).

A manual deletion of the machine fixed the problem.

What did you expect to happen: the KCP controller should be able to find out that the member was already removed and that there is no need for further etcd changes, and not block the deletion due to a 2 node etcd cluster. Before taking the path of trying to remove the etcd member, the KCP controller should check whether that member is still present. if not, it should skip all checks and actions related to etcd, and proceed with the deletion of the machine.

Environment:

Cluster-api version: 0.3.14 (but the remediation code has not changed in master)
Kubernetes version: (use kubectl version): 1.20.4

/kind bug

@fabriziopandini Do you think we could add a check, listing the members of the etcd cluster, to find out if it was already removed or not ?

fabriziopandini commented 3 years ago

@maelk I would like to understand in the first instance why the machine di not get deleted and KCP restarted.

the KCP controller should be able to find out that the member was already removed and that there is no need for further etcd changes, and not block the deletion due to a 2 node etcd cluster. Before taking the path of trying to remove the etcd member, the KCP controller should check whether that member is still present. if not, it should skip all checks and actions related to etcd, and proceed with the deletion of the machine.

KCP tries to figure out the target etcd cluster, and this logic should already ignore the machine being deleted. However, it seems to not work in this case, but without logs/additional info it is really difficult to understand why and or propose fixes..

maelk commented 3 years ago

I'm looking into why it failed in the first place. If I find the reason and it is a bug, I will open another issue. But I think it should not matter for the next round of reconciliation. We always hit this : https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/controllers/remediation.go#L113 . I will reproduce the error and fetch the logs. There was however nothing more than that error message after the controller is restarted.

maelk commented 3 years ago

While I fetch the logs, would you mind explaining the logic you mention ? By reading the code, I can see that the number of members is checked, without even considering that the machine's member might have already been removed : https://github.com/kubernetes-sigs/cluster-api/blob/5cd1132aa75f6358d66cf977164ace68ec87efe0/controlplane/kubeadm/controllers/remediation.go#L197 . That check is always run, since called by https://github.com/kubernetes-sigs/cluster-api/blob/5cd1132aa75f6358d66cf977164ace68ec87efe0/controlplane/kubeadm/controllers/remediation.go#L107.

fabriziopandini commented 3 years ago

The cluster MUST have at least 3 members, because this is the smallest cluster size that allows any etcd failure tolerance.

This was one of the assumption in CAEP, because as stated by the etcd documentation with less than 3 member are providing 0 as a tolerance.

Of course there are rooms of improvements, assuming we can find a situation where it we are 100% sure to preserve the operational state of the cluster.

This is potentially one of those case; in fact from a quick look at the code we are already tolerating failures on the member being remediated, but we are not tolerating the total absence of the member being remediated.

However, at first sight, I'm not sure this is something easy to achieve, because in this case remediation should take special considerations when taking care of etcd (eg. a different approach in calculating quorum/target).

What is really important IMO is that we stick on the principle that the source of truth is etcd itlself given that this operation could be destructive (not the list o machines).

/remove-kind bug /kind feature

maelk commented 3 years ago

Since we have the list of members, we could verify that the machine we are remediating is not in the list of members, or, if it is, that we have at least 3 members, and go further with the checks. Do you think it would be enough as a check, or would it not be sufficient to ensure the resiliency ?

fabriziopandini commented 3 years ago

I'm not sure that skipping the evaluation of etcd state when the machine we are remediating is not in the list of current members is a good idea.

If we proceed with the remediation without checking the etcd state in the case of "machine member already removed", the the machine will be deleted and a new one will be added joining an etcd member on a cluster which we are not sure is in a fully operational state, and this can lead to the cluster loosing quorum. If I remember well, a more detailed explanation about the above risk is in the KCP document under the remediation paragraph.

This is why in my previous comment I was assuming that

I'm not sure this is something easy to achieve, because in this case remediation should take special considerations when taking care of etcd (eg. a different approach in calculating quorum/target). What is really important IMO is that we stick on the principle that the source of truth is etcd itlself given that this operation could be destructive (not the list o machines).

fabriziopandini commented 3 years ago

/milestone v0.4 /area control-plane

maelk commented 3 years ago

I found the root cause of the failure, why the controller fails in the first place. We are using a VIP setup, and the machine marked unhealthy has the VIP, so when etcd dies, the kube api server stops answering for a while (until the failover mechanism kicks in) We have this kind of logs then :

{"log":"I0325 07:00:40.914700       1 remediation.go:189] controllers/KubeadmControlPlane \"msg\"=\"etcd cluster before remediation\" \"cluster\"=\"eshaiis-cluster\" \"kubeadmControlPlane\"=\"cp\" \"namespace\"=\"eshaiis-cluster-capi\" \"currentMembers\"=[\"cp-baremetal-eshaiis-cluster-0\",\"cp-baremetal-eshaiis-cluster-1\",\"cp-baremetal-eshaiis-cluster-3\"] \"currentTotalMembers\"=3\n","stream":"stderr","time":"2021-03-25T07:00:40.918651066Z"}
{"log":"I0325 07:00:40.914826       1 remediation.go:244] controllers/KubeadmControlPlane \"msg\"=\"etcd cluster projected after remediation of cp-p2gpn\" \"cluster\"=\"eshaiis-cluster\" \"kubeadmControlPlane\"=\"cp\" \"namespace\"=\"eshaiis-cluster-capi\" \"healthyMembers\"=[\"cp-baremetal-eshaiis-cluster-0 (cp-p2gpn)\",\"cp-baremetal-eshaiis-cluster-1 (cp-jjxkj)\",\"cp-baremetal-eshaiis-cluster-3 (cp-v6nlb)\"] \"projectedQuorum\"=2 \"targetQuorum\"=2 \"targetTotalMembers\"=2 \"targetUnhealthyMembers\"=0 \"unhealthyMembers\"=[]\n","stream":"stderr","time":"2021-03-25T07:00:40.918679883Z"}
{"log":"I0325 07:00:53.415011       1 leaderelection.go:288] failed to renew lease eshaiis-cluster-capi/kubeadm-control-plane-manager-leader-election-capi: failed to tryAcquireOrRenew context deadline exceeded\n","stream":"stderr","time":"2021-03-25T07:00:53.415448627Z"}
{"log":"E0325 07:00:53.415140       1 main.go:149] setup \"msg\"=\"problem running manager\" \"error\"=\"leader election lost\"  \n","stream":"stderr","time":"2021-03-25T07:00:53.415510082Z"}

That explains why KCP is unable to delete the machine (API server is not answering). For now we can go around the problem by ensuring that the VIP will not be on that node when we remediate. I still think that the resiliency might be improved by handling this case when remediating a machine whose etcd member has been removed.

maelk commented 3 years ago

After the restart of the controller, then the error messages are the following :

{"log":"I0325 08:49:27.927750       1 controller.go:244] controllers/KubeadmControlPlane \"msg\"=\"Reconcile KubeadmControlPlane\" \"cluster\"=\"eshaiis-cluster\" \"kubeadmControlPlane\"=\"cp\" \"namespace\"=\"eshaiis-cluster-capi\" \n","stream":"stderr","time":"2021-03-25T08:49:27.928404137Z"}
{"log":"I0325 08:49:37.642058       1 remediation.go:189] controllers/KubeadmControlPlane \"msg\"=\"etcd cluster before remediation\" \"cluster\"=\"eshaiis-cluster\" \"kubeadmControlPlane\"=\"cp\" \"namespace\"=\"eshaiis-cluster-capi\" \"currentMembers\"=[\"cp-baremetal-eshaiis-cluster-1\",\"cp-baremetal-eshaiis-cluster-3\"] \"currentTotalMembers\"=2\n","stream":"stderr","time":"2021-03-25T08:49:37.642492426Z"}
{"log":"I0325 08:49:37.642157       1 remediation.go:198] controllers/KubeadmControlPlane \"msg\"=\"etcd cluster with less of 3 members can't be safely remediated\" \"cluster\"=\"eshaiis-cluster\" \"kubeadmControlPlane\"=\"cp\" \"namespace\"=\"eshaiis-cluster-capi\" \n","stream":"stderr","time":"2021-03-25T08:49:37.642561182Z"}
{"log":"I0325 08:49:37.642197       1 remediation.go:113] controllers/KubeadmControlPlane \"msg\"=\"A control plane machine needs remediation, but removing this machine could result in etcd quorum loss. Skipping remediation\" \"cluster\"=\"eshaiis-cluster\" \"kubeadmControlPlane\"=\"cp\" \"namespace\"=\"eshaiis-cluster-capi\" \"UnhealthyMachine\"=\"cp-p2gpn\"\n","stream":"stderr","time":"2021-03-25T08:49:37.642571143Z"}

fabriziopandini commented 3 years ago

/lifecycle active

kubernetes-sigs / cluster-api

Incorrect etcd members remediation in KCP after controller failure. #4365