Allow KCP remediation when access to the etcd leader is not possible

kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle

https://cluster-api.sigs.k8s.io

Apache License 2.0

3.52k stars 1.3k forks source link

Allow KCP remediation when access to the etcd leader is not possible #8465

Open fabriziopandini opened 1 year ago

fabriziopandini commented 1 year ago

What steps did you take and what happened?

if you stop kubelet on the etcd leader member, this prevents KCP from doing some checks it is expecting to do on the leader - and specifically on the leader -. This prevents remediation to happen.

What did you expect to happen?

KCP remediation limitation to be documented Eventually also the error message could be improved

  - lastTransitionTime: "2023-03-27T07:39:22Z"
    message: 'failed to get etcdStatus for workload cluster wl-antrea: failed to create
      etcd client: could not establish a connection to any etcd node: unable to create
      etcd client: context deadline exceeded'
    reason: RemediationFailed @ Machine/wl-antrea-g4pt2-mrjpl
    severity: Error
    status: "False"
    type: ControlPlaneReady

Cluster API version

main, 1.4.0, older releases

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind documentation /area control-plane /triage accepted

sbueringer commented 1 year ago

Is there anything we can do to improve the behavior instead of just documenting the limitation?

I assume we get the same error not only when manually stopping the kubelet but also when the kubelet is crashing.

vincepri commented 1 year ago

Yeah agree with @sbueringer that potentially a better solution needs to be found; we might need to revive other topics like external etcd (saw the other issue pop up) and/or having an agent that’s separate from the workload’s cluster api server.

In the meantime we should definitely improve the error message, as it might even be a critical failure in some cases, unless somehow we can move the leader forcibly?

nehagjain15 commented 1 year ago

Another scenario where KCP remediation does not work should be documented. It can be fixed when the above issue is handled. When a node that is etcd leader for a 3CP cluster is deleted the following error is raised:

    Last Transition Time:  2023-08-09T19:53:04Z
    Message:               failed to get etcdStatus for workload cluster cc265-35ns2-c2: 
        failed to create etcd client: etcd leader is reported as 208da8f7e46408c1 with name
         "cc265-35ns2-c2-m5bhq-2bzkd", but we couldn't find a corresponding Node in the cluster
    Reason:                RemediationFailed @ /cc265-35ns2-c2-m5bhq-2bzkd
    Severity:              Error
    Status:                False
    Type:                  ControlPlaneReady

fabriziopandini commented 1 year ago

From a quick search we are requiring access to the etcd leader member in two places:

Workload.EtcdMembers, and IMO we can safely move to access any member for this operation
Workload.ForwardEtcdLeadership

If we fix 1. and then 2. is the only place where we are using etcdClientFor.forLeader, we can consider changing the implementation falling back on using any etcd member if the leader cannot be reached

@sbueringer @vincepri opinions?

fabriziopandini commented 9 months ago

We discovered that fixing 2. is tricky due to a limitation of the etcd API (forward leader cannot be called on a member that is not the actual leader) The only viable option is to accept to remediate without moving the leader, which means going through a leader election + a few seconds of instability before this process is triggered.

The tricky part is to find a way to determine when moving leader election is failing due to temporary issues (and thus we can retry in few seconds) and when instead things are broken in a permanent way

vincepri commented 9 months ago

I can dig through a bit more but the current logic in https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/etcd_client_generator.go tries to find the leader through the list of members and connect to it via the node; kubelet here is technically required because of the proxy.

In this case there is a few options we could consider:

Like @fabriziopandini mentioned, we could proceed with remediation, wait for etcd to elect a new leader,
- That said, are we sure that etcd is going to actually elect a new leader with only 2 nodes available? Or, that when the third comes up it'll reconcile itself?
We could technically proxy through the container, back to the leader and issue commands that way, like a tunnel from [ CAPI ] => [ Healthy ETCD Node Pod ] => [ Unhealthy ETCD Node Pod ]
- There should be network connectivity between the nodes, which is used in the second arrow
- Assuming here that the etcd pod on the node without healthy kubelet is still in the cluster, listening, and healthy

fabriziopandini commented 5 months ago

/priority important-longterm

fabriziopandini commented 4 months ago

/assign

fabriziopandini commented 4 months ago

I have opened a PR to document the current state, but I would like to dig again into this, so /title Allow KCP remediation when access to the etcd leader is not possible