rancher / cluster-api-provider-rke2

RKE2 bootstrap and control-plane Cluster API providers.
https://rancher.github.io/cluster-api-provider-rke2/
Apache License 2.0
84 stars 30 forks source link

Rolling upgrades are blocked by nodes that are not properly drained #431

Closed zioc closed 1 month ago

zioc commented 1 month ago

What happened:

The issue has been described in https://gitlab.com/sylva-projects/sylva-core/-/issues/1595

During rolling upgrades, control-plane node machine is removed from etcd cluster as soon as the machine is being rolled out here.

The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.

We should probably try to avoid removing etcd member so early during rolling upgrades, we could instead rely on periodic reconcileEtcdMembers that ensures the number of etcd members is in sync with the number of machines/nodes, this way etcd members will be removed only after the node has been properly drained and removed from cluster by capi controller.

tmmorin commented 1 month ago

@hardys @Danil-Grigorev @alexander-demicev , as I said today on Slack:

it seems to us that the longer it takes to drain, the more likely the occurrences are

failures due to this behavior (which seem to us as being a bug) are a frequent cause of CI failures in Sylva project pipelines, so this problem is quite "hot" for us