Cluster Autoscaler not backing off exhausted node group

elohmeier commented 10 months ago

Which component are you using?:

Cluster Autoscaler

What version of the component are you using?:

registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: v1.26.6+k3s1
Kustomize Version: v4.5.7
Server Version: v1.26.6+k3s1

What environment is this in?:

Hetzner Cloud

What did you expect to happen?:

When the cluster autoscaler is configured with a priority expander, and multiple node groups of differing priorities are provided, the cluster autoscaler should back-off after some time if the cloud provider fails to provision nodes in the high priority node group due to resource unavailability and proceed to lower priority node groups.

What happened instead?:

The high priority node group (in the below log pool1) has no resources available currently to provision the requested nodes. The cluster autoscaler is stuck in a loop trying to provision nodes in the high-prio group and not proceeding to pool2 (lower prio, resources available). I've also tried to set --max-node-group-backoff-duration=1m with no effect.

W1101 05:15:05.399825       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:15.806488       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:15.806519       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:15.806525       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:15.808727       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:16.068533       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.079704       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.126786       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
W1101 05:15:16.126816       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:26.655179       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:26.655243       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:26.655257       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:26.660093       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:26.948368       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:26.981452       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:27.044150       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)

How to reproduce it (as minimally and precisely as possible):

apiVersion: v1
data:
  priorities: |
    10:
      - pool2
    20:
      - pool1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --scale-down-unneeded-time=5m
        - --cloud-provider=hetzner
        - --stderrthreshold=info
        - --nodes=0:4:CCX43:FSN1:pool1
        - --nodes=0:4:CCX43:NBG1:pool2
        - --expander=priority
        env:
        - name: HCLOUD_IMAGE
          value: debian-11
        - name: HCLOUD_TOKEN
          valueFrom:
            secretKeyRef:
              key: token
              name: hcloud
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4
        name: cluster-autoscaler
      serviceAccountName: cluster-autoscaler

Anything else we need to know?:

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elohmeier commented 6 months ago

/remove-lifecycle stale

apricote commented 4 months ago

Is there something we need to do as the cloudprovider to make this possible? Otherwise this looks like an area/core-autoscaler issue.

tallaxes commented 4 months ago

@apricote Looking at the relevant provider code, it seems possible that on failure it logs a message, but neglects to return the error from IncreaseSize. This means the core autoscaler does not have any indication the scaleup has failed. (More generally, this could affect any provider that neglects reporting an error from IncreaseSize ...)

apricote commented 4 months ago

Thanks for the hint @tallaxes! I opened a PR to properly return encountered errors.

kubernetes / autoscaler

Cluster Autoscaler not backing off exhausted node group #6240