Open fabriziopandini opened 1 year ago
Is there anything we can do to improve the behavior instead of just documenting the limitation?
I assume we get the same error not only when manually stopping the kubelet but also when the kubelet is crashing.
Yeah agree with @sbueringer that potentially a better solution needs to be found; we might need to revive other topics like external etcd (saw the other issue pop up) and/or having an agent that’s separate from the workload’s cluster api server.
In the meantime we should definitely improve the error message, as it might even be a critical failure in some cases, unless somehow we can move the leader forcibly?
Another scenario where KCP remediation does not work should be documented. It can be fixed when the above issue is handled. When a node that is etcd leader for a 3CP cluster is deleted the following error is raised:
Last Transition Time: 2023-08-09T19:53:04Z
Message: failed to get etcdStatus for workload cluster cc265-35ns2-c2:
failed to create etcd client: etcd leader is reported as 208da8f7e46408c1 with name
"cc265-35ns2-c2-m5bhq-2bzkd", but we couldn't find a corresponding Node in the cluster
Reason: RemediationFailed @ /cc265-35ns2-c2-m5bhq-2bzkd
Severity: Error
Status: False
Type: ControlPlaneReady
From a quick search we are requiring access to the etcd leader member in two places:
If we fix 1. and then 2. is the only place where we are using etcdClientFor.forLeader, we can consider changing the implementation falling back on using any etcd member if the leader cannot be reached
@sbueringer @vincepri opinions?
We discovered that fixing 2. is tricky due to a limitation of the etcd API (forward leader cannot be called on a member that is not the actual leader) The only viable option is to accept to remediate without moving the leader, which means going through a leader election + a few seconds of instability before this process is triggered.
The tricky part is to find a way to determine when moving leader election is failing due to temporary issues (and thus we can retry in few seconds) and when instead things are broken in a permanent way
I can dig through a bit more but the current logic in https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/etcd_client_generator.go tries to find the leader through the list of members and connect to it via the node; kubelet here is technically required because of the proxy.
In this case there is a few options we could consider:
[ CAPI ]
=> [ Healthy ETCD Node Pod ]
=> [ Unhealthy ETCD Node Pod ]
/priority important-longterm
/assign
I have opened a PR to document the current state, but I would like to dig again into this, so /title Allow KCP remediation when access to the etcd leader is not possible
What steps did you take and what happened?
if you stop kubelet on the etcd leader member, this prevents KCP from doing some checks it is expecting to do on the leader - and specifically on the leader -. This prevents remediation to happen.
What did you expect to happen?
KCP remediation limitation to be documented Eventually also the error message could be improved
Cluster API version
main, 1.4.0, older releases
Kubernetes version
No response
Anything else you would like to add?
No response
Label(s) to be applied
/kind documentation /area control-plane /triage accepted