kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.94k stars 3.93k forks source link

GKE cluster autoscaler doesn't handle virtual kubelet nodes #6704

Open marwanad opened 4 months ago

marwanad commented 4 months ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?:

Component version: 1.27 - managed GKE

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version

What environment is this in?: GKE

What did you expect to happen?: Cluster autoscaler should ignore virtual nodes in the cluster and initialize succesfully

What happened instead?: Cluster autsocaler is stuck in initializing state.

Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated: 2024-04-12 02:39:18.837676205 +0000 UTC

Data
====
status:
----
Cluster-autoscaler status at 2024-04-12 02:39:18.837676205 +0000 UTC:
Initializing

BinaryData
====

Events:  <none>

I don't have access to the control plane logs but these symptoms align with the obesrvations.

How to reproduce it (as minimally and precisely as possible): Run some VK workload, or create a node with no ProviderId.

Anything else we need to know?:

It seems that the GCE (and potentially GKE) provider doesn't handle nil provider ids

https://github.com/kubernetes/autoscaler/blob/8273c9ce0b8d556554a34e927fd430e6558547fa/cluster-autoscaler/cloudprovider/gce/gce_cloud_provider.go#L101 and returns an error.

I don't have access to the control plane logs for autoscaler but deleting the VK node seems to get it "unstuck". A lot of the main StaticAutoscaler routines call NodeGroupForNode.

/area/cluster-autoscaler /area/provider/gke

marwanad commented 4 months ago

seems like deleting any VK node unblocks it at least, doesn't have to be all of them which is interesting.

adrianmoisey commented 1 month ago

/area cluster-autoscaler