kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.1k stars 3.98k forks source link

[CAS] Critical bug in cloudprovider/rancher implementation of node scale down #7474

Open pvlkov opened 2 weeks ago

pvlkov commented 2 weeks ago

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.30.0

What k8s version are you using (kubectl version)?:

Rancher management cluster: v1.30.6+rke2r1 (RKE2) Downstream cluster: v1.30.6+rke2r1

What environment is this in?:

Self hosted, Rancher-provisioned RKE2 clusters on VMWare vSphere

What did you expect to happen?:

CAS should remove the correct amount of nodes from the cluster during scale down.

What happened instead?:

The CAS running in a downstream RKE2 cluster tried to delete a single node due to low usage. The deletion process was stuck in Rancher (for reasons currently unknown to us), resulting in CAS suddenly reducing the corresponding node group size to its minimal value. In our specific case this meant that a node group with 25 nodes was suddenly reduced to a size of 1, evicting all nodes and rendering all workloads unavailable instantly.

Analysis: We did a root cause analysis of the issue, since in our case a productive environment with customer workloads was affected. During this analysis we discovered a (in our opinion) critical bug in the scale-down process in the Rancher implementation of CAS. Looking at logs of the CAS pod right before the outage we discovered the following:

2024-11-06 07:30:56.523 I1106 06:30:56.523011       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:06.933 I1106 06:31:06.933853       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:17.388 I1106 06:31:17.388816       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:27.851 I1106 06:31:27.851784       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:38.474 I1106 06:31:38.474876       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:48.889 I1106 06:31:48.889420       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:59.360 I1106 06:31:59.360176       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:09.967 I1106 06:32:09.967698       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:20.457 I1106 06:32:20.457806       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:31.068 I1106 06:32:31.068193       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:41.542 I1106 06:32:41.542084       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:52.057 I1106 06:32:52.057269       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:02.541 I1106 06:33:02.540962       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:12.997 I1106 06:33:12.996980       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:23.419 I1106 06:33:23.419709       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:33.858 I1106 06:33:33.858363       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:44.439 I1106 06:33:44.439246       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:54.848 I1106 06:33:54.848388       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:05.303 I1106 06:34:05.303319       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:15.889 I1106 06:34:15.889135       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:26.363 I1106 06:34:26.362880       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:37.018 I1106 06:34:37.018566       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:47.533 I1106 06:34:47.533476       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted

The redaced vsphere ID is always the same and corresponds to the vsphere ID of the node that was being scaled down. The logs of course contained a load of other output, which was removed here for the sake of brevity and because these lines are in our opinion enough to understand what happened.

As is visible from the log, CAS went through it's main loop several times during a period of roughly 4 minutes, each time discovering an underutilized node and scheduling it for deletion in the Rancher management cluster. During this period Rancher was unable to reconcile the deletion for reasons unknown to us, but maybe related to high load on the Rancher manager. Whatever the case, each iteration of the loop calls the following code, which also produces the log output seen above: https://github.com/kubernetes/autoscaler/blob/55a18a301ee939a04980dd50a07064eda46aa537/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go#L109

which in turn calls the markMachineForDeletion function https://github.com/kubernetes/autoscaler/blob/55a18a301ee939a04980dd50a07064eda46aa537/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go#L378

This function sets an annotation on the machines.cluster.x-k8s.io object in the Rancher cluster corresponding to the machine which is scheduled for deletion. Afterwards, during each iteration the node group size is reduced by one via the setSize function: https://github.com/kubernetes/autoscaler/blob/55a18a301ee939a04980dd50a07064eda46aa537/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go#L251

the setSize function operates on the clusters.provisioning.cattle.io object in the management cluster, containing a list of machine pools with a quantity variable governing the amount of machines in this pool. https://github.com/kubernetes/autoscaler/blob/55a18a301ee939a04980dd50a07064eda46aa537/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go#L260

In our case, since reconciliation of the clusters.provisioning.cattle.io object was stuck on Rancher side for several minutes, the machine was not being deleted and CAS gradually reduced the size of the machine pool until it was at it's minimal size, at which point the check https://github.com/kubernetes/autoscaler/blob/55a18a301ee939a04980dd50a07064eda46aa537/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go#L110 stopped it.

At some point, Rancher was able to reconcile it's clusters.provisioning.cattle.io resource which was modified to the point where the quantity parameter was 1 and Rancher deleted all nodes from the machine pool.

In our opinion this is a critical oversight in the implementation of the scale down process, since even in a less extreme scenario where Rancher would fail to delete the node in the time frame between two loops of CAS (happening every 10s), CAS would delete more than the required amount of nodes. The interface definition of the nodeGroup clearly states, that DeleteNode should wait for the node to be deleted.

We are aware of the fact, that setting the minimum size for the node pool to a value closer to the actual amount of nodes would have mitigated the impact somewhat (and for now we have adapted our node groups to have this value as a temporary workaround), however this should not be expected and instead the DeleteNodes function should be made more idempotent in order to mitigate such outcomes.

Suggestion: A suggestion for a fix would be to implement a check in the markNodesForDeletion function, which checks if the deletion annotation is already set on a machine object and if so simply update the timestamp and return a custom error, which could then in turn be used to skip the setSize function. If this suggestion is acceptable, we would also be ready to provide a corresponding PR.

How to reproduce it (as minimally and precisely as possible):

Reproduction of this error is tricky, since it would require to somehow artificially keep Rancher from reconciling it's clusters.provisioning.cattle.io resource. Nevertheless, from a logical standpoint the code clearly shows, that this a thing that could happen.

Anything else we need to know?:

pvlkov commented 2 weeks ago

/area cluster-autoscaler

pvlkov commented 2 weeks ago

/area provider/rancher