kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.97k stars 3.94k forks source link

Not scaling up #6316

Closed RazaGR closed 9 months ago

RazaGR commented 9 months ago

For some reason CA is not scaling up, I always see longUnregistered values between 1-4, here are the logs and info:

kubectl get -n kube-system configmap cluster-autoscaler-status -o yaml
apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2023-11-24 10:25:10.316990961 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=6 unready=0 notStarted=0 longNotStarted=0 registered=6 longUnregistered=1)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 09:11:02.53461218 +0000 UTC m=+67.404894764
      ScaleUp:     InProgress (ready=6 registered=6)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 10:04:34.898806071 +0000 UTC m=+3279.769088711
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 09:11:02.53461218 +0000 UTC m=+67.404894764

    NodeGroups:
      Name:        nodes.production-cluster.k8s.local
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=1 cloudProviderTarget=5 (minSize=3, maxSize=12))
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 09:26:11.644902539 +0000 UTC m=+976.515185174
      ScaleUp:     InProgress (ready=1 cloudProviderTarget=5)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 10:14:09.688949574 +0000 UTC m=+3854.559232154
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 09:11:02.53461218 +0000 UTC m=+67.404894764

      Name:        monitoring.production-cluster.k8s.local
      Health:      Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0 cloudProviderTarget=2 (minSize=1, maxSize=3))
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 09:11:02.53461218 +0000 UTC m=+67.404894764
      ScaleUp:     Backoff (ready=2 cloudProviderTarget=2)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 10:19:39.998068853 +0000 UTC m=+4184.868351539
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2023-11-24 10:25:10.061622308 +0000 UTC m=+4514.931904993
                   LastTransitionTime: 2023-11-24 09:11:02.53461218 +0000 UTC m=+67.404894764

pod logs for auto scaler:

E1124 10:28:04.696511       1 utils.go:60] pod.Status.StartTime is nil for pod happy-bunny-toolbox-1700787600-p9gzh. Should not reach here.
I1124 10:28:04.696557       1 filter_out_schedulable.go:125] Pod exiled-octopus-subscription-6dff8bd74-5mdbl marked as unschedulable can be scheduled on upcoming node template-node-for-nodes.production-cluster.k8s.local-5412385053388806738-1. Ignoring in scale up.
I1124 10:28:04.696577       1 filter_out_schedulable.go:125] Pod happy-bunny-toolbox-backup-1700787600-p9gzh marked as unschedulable can be scheduled on upcoming node template-node-for-nodes.production-cluster.k8s.local-5412385053388806738-0. Ignoring in scale up.

and deployment:

kubectl describe deployment cluster-autoscaler
Name:                   cluster-autoscaler
Namespace:              kube-system
CreationTimestamp:      Fri, 24 Jul 2020 12:00:34 +0200
Labels:                 app=cluster-autoscaler
Annotations:            deployment.kubernetes.io/revision: 4
Selector:               app=cluster-autoscaler
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=cluster-autoscaler
  Annotations:      prometheus.io/port: 8085
                    prometheus.io/scrape: true
  Service Account:  cluster-autoscaler
  Containers:
   cluster-autoscaler:
    Image:      gcr.io/google-containers/cluster-autoscaler:v1.17.1
    Port:       <none>
    Host Port:  <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --expander=least-waste
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --nodes=3:12:nodes.production-cluster.k8s.local
      --nodes=1:3:test.production-cluster.k8s.local
    Limits:
      cpu:     200m
      memory:  900Mi
    Requests:
      cpu:     100m
      memory:  600Mi
    Environment:
      AWS_REGION:  eu-west-1
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
  Volumes:
   ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-certificates.crt
    HostPathType:
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   cluster-autoscaler-5c6d95db75 (1/1 replicas created)
Events:          <none>

and some entries for various pods like this:

I1124 10:34:03.491670       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"myservice", Name:"myservice-7b694dbd4b-ns4nr", UID:"15a3ca9c-4059-4b91-a92d-607fb31a4fb2", APIVersion:"v1", ResourceVersion:"741944579", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 in backoff after failed scale-up, 1 max node group size reached

another thing I noticed for pods which can't be deployed have this message:

Pod ingress-nginx-controller-b85b758cc-k87vw marked as unschedulable can be scheduled on upcoming node template-node-for-nodes.production-cluster.k8s.local-6801475017901952134-1. Ignoring in scale up.

and when I check logs again the node name is changed, in a minute

Pod ingress-nginx-controller-b85b758cc-k87vw marked as unschedulable can be scheduled on upcoming node template-node-for-nodes.production-cluster.k8s.local-5617487959455756375-0. Ignoring in scale up.

edit;

I am seeing these as well in logs:

Node group nodes.production-cluster.k8s.local is not ready for scaleup - backoff
Kind:"Pod", ... 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 in backoff after failed scale-up
...

what can I do?

RazaGR commented 9 months ago

after updating InstanceGroup image of Kops it is fixed.