kubermatic / kubeone

Kubermatic KubeOne automate cluster operations on all your cloud, on-prem, edge, and IoT environments.
https://kubeone.io
Apache License 2.0
1.38k stars 233 forks source link

Cluster-autoscaler addon does not create min number of nodes #1394

Closed Tranquility closed 3 years ago

Tranquility commented 3 years ago

What happened: Cluster-autoscaler logs node group min size reached instead of scaling up the worker nodes.

What is the expected behavior: New nodes should have been created.

How to reproduce the issue:

  1. Install cluster-autoscaler addon
  2. Install MachineDeployment:
MachineDeployment ```yaml apiVersion: cluster.k8s.io/v1alpha1 kind: MachineDeployment metadata: annotations: cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "6" cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "20" finalizers: - foregroundDeletion generation: 1 name: {{ env "CLUSTER_NAME" }}-pool2 namespace: kube-system spec: minReadySeconds: 0 progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 1 selector: matchLabels: workerset: {{ env "CLUSTER_NAME" }}-pool2 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate template: metadata: creationTimestamp: null labels: workerset: {{ env "CLUSTER_NAME" }}-pool2 namespace: kube-system spec: metadata: creationTimestamp: null labels: workerset: {{ env "CLUSTER_NAME" }}-pool2 providerSpec: value: cloudProvider: hetzner cloudProviderSpec: image: ubuntu-20.04 labels: {{ env "CLUSTER_NAME" }}-workers: pool2 location: nbg1 networks: - {{ .Config.CloudProvider.Hetzner.NetworkID }} serverType: ccx12 operatingSystem: ubuntu operatingSystemSpec: distUpgradeOnBoot: false sshPublicKeys: - {{ env "HCLOUD_SSH_PUBLIC_KEY" }} versions: kubelet: {{ .Config.Versions.Kubernetes }} ```

I installed a new machine deployment because I couldn't get the annotations applied via the terraform integration (I even ran kubeone upgrade).

The logs show that the min size is 6, and it only finds 1 node but still says the min number is reached. Here is the relevant log output:

I0621 15:01:16.864337       1 clusterapi_provider.go:67] discovered node group: MachineDeployment/kube-system/stage-001-pool2 (min: 6, max: 20, replicas: 1)
I0621 15:01:17.056274       1 request.go:600] Waited for 191.712326ms due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/cluster.k8s.io/v1alpha1/namespaces/kube-system/machinedeployments/stage-001-pool1/scale
I0621 15:01:17.056380       1 round_trippers.go:432] GET https://10.96.0.1:443/apis/cluster.k8s.io/v1alpha1/namespaces/kube-system/machinedeployments/stage-001-pool1/scale
I0621 15:01:17.056402       1 round_trippers.go:438] Request Headers:
I0621 15:01:17.056416       1 round_trippers.go:442]     Accept: application/json, */*
I0621 15:01:17.056429       1 round_trippers.go:442]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0621 15:01:17.056443       1 round_trippers.go:442]     Authorization: Bearer <masked>
I0621 15:01:17.063740       1 round_trippers.go:457] Response Status: 200 OK in 7 milliseconds
I0621 15:01:17.256282       1 request.go:600] Waited for 192.2447ms due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/cluster.k8s.io/v1alpha1/namespaces/kube-system/machinedeployments/stage-001-pool2/scale
I0621 15:01:17.256388       1 round_trippers.go:432] GET https://10.96.0.1:443/apis/cluster.k8s.io/v1alpha1/namespaces/kube-system/machinedeployments/stage-001-pool2/scale
I0621 15:01:17.256414       1 round_trippers.go:438] Request Headers:
I0621 15:01:17.256427       1 round_trippers.go:442]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0621 15:01:17.256448       1 round_trippers.go:442]     Accept: application/json, */*
I0621 15:01:17.256462       1 round_trippers.go:442]     Authorization: Bearer <masked>
I0621 15:01:17.265289       1 round_trippers.go:457] Response Status: 200 OK in 8 milliseconds
I0621 15:01:17.266568       1 clusterapi_controller.go:556] node "stage-001-pool2-6c885b68dd-tfr9s" is in nodegroup "MachineDeployment/kube-system/stage-001-pool2"
I0621 15:01:17.266720       1 pre_filtering_processor.go:66] Skipping stage-001-pool2-6c885b68dd-tfr9s - node group min size reached
I0621 15:01:17.266906       1 pre_filtering_processor.go:57] Skipping stage-001-control-plane-1 - no node group config
I0621 15:01:17.267047       1 pre_filtering_processor.go:57] Skipping stage-001-control-plane-2 - no node group config
I0621 15:01:17.267200       1 pre_filtering_processor.go:57] Skipping stage-001-control-plane-3 - no node group config
I0621 15:01:17.267729       1 clusterapi_controller.go:556] node "stage-001-pool1-6bc7696c6d-ntcdm" is in nodegroup "MachineDeployment/kube-system/stage-001-pool1"
I0621 15:01:17.267778       1 pre_filtering_processor.go:66] Skipping stage-001-pool1-6bc7696c6d-ntcdm - node group min size reached

Anything else we need to know? I am using the hetzner cloud provider. And I updated the resources to match according to this PR https://github.com/kubernetes/autoscaler/pull/4020/files

Information about the environment: KubeOne version (kubeone version): master Operating system: Provider you're deploying cluster on: hetzner Operating system you're deploying on: ubuntu-20.04

Tranquility commented 3 years ago

Maybe I should have selected the support issue template. Feel free to change it.

xmudrii commented 3 years ago

@Tranquility I managed to reproduce the issue. Also, I found out that in case .spec.replicas is lower than min-size, the MachineDeployment will never be autoscaled, even if there are Pending pods due to insufficient memory or any other reason.

It seems like setting .spec.replicas to the same value as the min-size annotation (e.g. 6 in your case) mitigates the issue. I'm not sure is this expected or not, so I have to investigate a bit more. I'll keep this issue updated with findings.

kron4eg commented 3 years ago

Setting replies < minimum replicas is kinda useless. Apparently this logic breaks cluster-autoscaler.

kron4eg commented 3 years ago

Closing this as it's not up to Kubeone to manage or validate this.