Closed zioc closed 1 month ago
@hardys @Danil-Grigorev @alexander-demicev , as I said today on Slack:
it seems to us that the longer it takes to drain, the more likely the occurrences are
failures due to this behavior (which seem to us as being a bug) are a frequent cause of CI failures in Sylva project pipelines, so this problem is quite "hot" for us
What happened:
The issue has been described in https://gitlab.com/sylva-projects/sylva-core/-/issues/1595
During rolling upgrades, control-plane node machine is removed from etcd cluster as soon as the machine is being rolled out here.
The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.
We should probably try to avoid removing etcd member so early during rolling upgrades, we could instead rely on periodic reconcileEtcdMembers that ensures the number of etcd members is in sync with the number of machines/nodes, this way etcd members will be removed only after the node has been properly drained and removed from cluster by capi controller.