Scale down works but scale up gives pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector

nitingadekar commented 4 years ago

I have created custom nodes in Rancher (aws provider) cluster following doc https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cluster-autoscaler/amazon/

After deploying cluster-autoscaler with nodes auto-discovery and also in single node with below tags and configuration, the scale down works fine, but scaleup gives

I0910 18:42:33.914842       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"test-69c4886596-pssfj", UID:"bb1cd249-0fd3-42b9-bc01-f22b0bbafa64", APIVersion:"v1", ResourceVersion:"11897361", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector

I have verified the known reasons for the above error :

The ASG max value is 10, and the desired is 1, so the max value is not reached yet.
Only pod running in the node group requires below resources, and the node capacity is c5.xlarge (8Gb Ram, 4vCPU). So the pod size is less than node size and ideally should fit if new node is scaled.
```
Limits:
  cpu:     100m
  memory:  500Mi
Requests:
  cpu:        100m
  memory:     300Mi
```

Tags to the ASG and respective nodes

aws:autoscaling:groupName:test-asg
k8s.io/cluster-autoscaler/enabled:true
k8s.io/cluster-autoscaler/cluster-acc:true 
kubernetes.io/cluster/c-fdwnz:owned

Cluster-autoscaler command with auto-discovery :

    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/cluster-acc
      --logtostderr=true
      --stderrthreshold=info
      --v=4

Cluster single node command:

    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --nodes=1:10:test-asg

In both cases, the issue is the same where scale down works but scale-up is not getting triggered

Considering the scale down, The autoscaler is scaling down nodes if they are vacant or underutilized.

I0911 06:15:24.571353       1 scale_down.go:442] Skipping ip-172-39-0-122.us-east-2.compute.internal from delete consideration - the node is currently being deleted
I0911 06:15:24.571589       1 static_autoscaler.go:439] Scale down status: unneededOnly=false lastScaleUpTime=2020-09-10 15:14:40.551745861 +0000 UTC m=+39.696645101 lastScaleDownDeleteTime=2020-09-10 15:14:40.551745941 +0000 UTC m=+39.696645189 lastScaleDownFailTime=2020-09-10 15:14:40.551746029 +0000 UTC m=+39.696645269 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0911 06:15:24.571618       1 static_autoscaler.go:452] Starting scale down
I0911 06:15:24.571664       1 scale_down.go:776] No candidates for scale down
I0911 06:15:33.206749       1 reflector.go:419] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:328: Watch close - *v1.Job total 0 items received
I0911 06:15:34.185430       1 node_tree.go:100] Removed node "ip-172-39-0-122.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" from NodeTree

As the error msg is not clear enough to understand the root cause, I am helpless with no workaround for this issue. Any help suggestion will be appreciated, thanks in advance.

korjek commented 3 years ago

@nitingadekar what's your pod specification? could you please provide output of kubexctl describe pod YOUR_POD_HERE. and also what other pods are running in the cluster? (kubectl get pods -A)

nitingadekar commented 3 years ago

@korjek Pod description is as below which goes in pending once scaled beyond node capacity. Renaming service to avoid sharing client data.

└─ $ ▶ k describe pod foo-54c8d67c97-4t28z -n bar
Name:           foo-54c8d67c97-4t28z
Namespace:      bar
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/instance=foo
                app.kubernetes.io/name=foo
                pod-template-hash=54c8d67c97
Annotations:    prometheus.io/path: /actuator/prometheus
                prometheus.io/port: 9001
                prometheus.io/scrape: true
Status:         Pending
IP:             
Controlled By:  ReplicaSet/foo-54c8d67c97
Containers:
  iom:
    Image:      registry.gitlab.com/prorepo/foo:v3.2.0-rc1
    Port:       9001/TCP
    Host Port:  0/TCP
    Args:
      --spring.profiles.active=k8-bar
      -Xms150m
      -Xmx285m
    Limits:
      cpu:     300m
      memory:  300Mi
    Requests:
      cpu:        100m
      memory:     150Mi
    Liveness:     exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from foo-token-b6kgl (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  foo-token-b6kgl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  foo-token-b6kgl
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/hostname=bar-common
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age        From                Message
  ----     ------             ----       ----                -------
  Warning  FailedScheduling   <unknown>  default-scheduler   0/20 nodes are available: 1 Insufficient cpu, 19 node(s) didn't match node selector.
  Warning  FailedScheduling   <unknown>  default-scheduler   0/20 nodes are available: 1 Insufficient cpu, 19 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  28s        cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector

Other services running in the cluster.

└─ $ ▶ k get pod -A -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
service1-856544f6fb-tf87d   1/1     Running   0          7d19h   10.42.44.24   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
service2-56bc85bf4b-hktpw       1/1     Running   0          12d     10.42.44.22   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
foo-54c8d67c97-4t28z      0/1     Pending   0          6m31s   <none>        <none>                                       <none>           <none>
foo-54c8d67c97-68rf8      1/1     Running   0          6m31s   10.42.44.25   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
foo-54c8d67c97-d74kd      1/1     Running   0          6m31s   10.42.44.27   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
foo-54c8d67c97-ftd52      1/1     Running   0          6m31s   10.42.44.26   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
foo-54c8d67c97-nl66q      1/1     Running   0          6m31s   10.42.44.29   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
foo-54c8d67c97-v7smg      1/1     Running   0          15d     10.42.44.15   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
foo-54c8d67c97-xl6kr      1/1     Running   0          6m31s   10.42.44.28   ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
service3-76fc85f8b9-67s4v      1/1     Running   0          3m53s   10.42.8.3     ip-172-39-0-236.us-east-2.compute.internal   <none>           <none>
service4-79fc9cb4f9-kt97z    1/1     Running   0          16h     10.42.13.3    ip-172-39-0-137.us-east-2.compute.internal   <none>           <none>
service5-66687f8cd4-49qd7     1/1     Running   0          27d     10.42.44.9    ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>
service5-66687f8cd4-cdvmz     1/1     Running   0          27d     10.42.44.8    ip-172-39-0-140.us-east-2.compute.internal   <none>           <none>

To give you context, the rancher registers the nodes from ASG as nodes but does not assign any node pool to them. All nodes are registered as independent nodes, irrespective of their autoscaling group. I suspect the node-group/node-pool not being present as the reason for scale-up issue. But still, the scale down works as described in the description above.

The ASG hosting the pod is of instance type t3a.medium (4Gb ram/ 2vCPUs) which should ideally place above pod.

Let me know if you need any further details.

korjek commented 3 years ago

There is a node selector in your pod specification that refers to node name: Node-Selectors: kubernetes.io/hostname=bar-common. I think this might be the reason: CA can't spin up a node with the specific name. Try to remove this selector or at least try to use another label than kubernetes.io/hostname=bar-common.

nitingadekar commented 3 years ago

@korjek Based on the use case, the applications demand different instance types. In the above example, there are 3 types of nodes, created with 3 ASGs, and the pods are placed in their respective nodes as below. node type 1 (bar-common): service1, service2, foo, service5 node type 2(memory-asg) : service3 node type 3 (cpu-asg): service4

So basically the foo application should trigger scaleup in the first ASG with node label kubernetes.io/hostname=bar-common.

According to your suggestion, I have removed the node-selector for foo pods, and tried scaling up the pods to choke up the entire cluster capacity,

It did work after scaling to 500 replicas, as below

k describe pod foo-5f887c5db-zjwhk -n bar
Name:           foo-5f887c5db-zjwhk
Namespace:      bar
Priority:       0
Node:           ip-172-39-0-24.us-east-2.compute.internal/172.39.0.24
Start Time:     Sun, 20 Sep 2020 01:29:54 +0530
Labels:         app.kubernetes.io/instance=foo
                app.kubernetes.io/name=foo
                pod-template-hash=5f887c5db
Annotations:    cni.projectcalico.org/podIP: 10.42.11.46/32
Status:         Running
IP:             10.42.11.46
Controlled By:  ReplicaSet/foo-5f887c5db
Containers:
  foo:
    Container ID:  docker://9af8a6889f0f1c25e64ffe92921a3d891792d7f19f6809d674f66e726c28582e
    Image:         registry.gitlab.com/prorepo/foo:v1
    Image ID:      docker-pullable://registry.gitlab.com/prorepo/foo@sha256:efb97fd684d4f48da9690a781f6254f78fa16488c05bb46c55657e3fc5e9cba7
    Port:          9001/TCP
    Host Port:     0/TCP
    Args:
      --spring.profiles.active=k8-bar
         State:          Running
      Started:      Sun, 20 Sep 2020 01:29:57 +0530
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     300m
      memory:  300Mi
    Requests:
      cpu:        100m
      memory:     150Mi
    Liveness:     exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from foo-token-b6kgl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  foo-token-b6kgl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  foo-token-b6kgl
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From                                                Message
  ----     ------            ----       ----                                                -------
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/19 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/24 nodes are available: 16 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/19 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/21 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 6 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/24 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 9 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/24 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 9 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler                                   0/24 nodes are available: 16 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled         <unknown>  default-scheduler                                   Successfully assigned bar/foo-5f887c5db-zjwhk to ip-172-39-0-24.us-east-2.compute.internal
  Normal   TriggeredScaleUp  26m        cluster-autoscaler                                  pod triggered scale-up: [{memory-asg 2->6 (max: 10)}]
  Normal   Pulled            21m        kubelet, ip-172-39-0-24.us-east-2.compute.internal  Container image "registry.gitlab.com/prorepo/foo:v1" already present on machine
  Normal   Created           21m        kubelet, ip-172-39-0-24.us-east-2.compute.internal  Created container foo
  Normal   Started           21m        kubelet, ip-172-39-0-24.us-east-2.compute.internal  Started container foo

Now the problem is, the pod is placed in the group it was not supposed to be. The cluster autoscaler didn't respect the node-selector and worked when the node-selector was removed. The memory-asg and cpu-asg are reserved for special type of applications, so the foo pods should get placed in any other node type other than these two or the bar-common group.

Is the problem with node-selector that cluster autoscaler didn't recognized the bar-common asg when node-selector was used? or the CA do not support scaling with node-selector. Can you shed more light on this?

korjek commented 3 years ago

Try to use other label than kubernetes.io/hostname. For example, add a label to your node group like k8s.io/cluster-autoscaler/node-template/label/pool=bar-common. Then use this label as a node selector in your pod configuration: Node-Selectors: pool=bar-common

nitingadekar commented 3 years ago

Hi Thanks @korjek, Issue is with the topology key kubernetes.io/hostname, this key is ignored as a predicate while scheduling, and this is why the pod was not triggering scaleup. It worked with a custom label.

I have another situation here. While adding the custom nodes using AWS ASG in Rancher cluster, there is a bug in rancher cluster agent, that assigns the custom node labels to only first node and if you scale the nodes further, the new nodes do not get the custom nodes. But somehow by setting the hostame which created this kubernetes.io/hostname=bar-common labels I got the custom label type key pair created in new nodes. So for triggering the scaleup, I need custom-labels, and for placement of the pod I need hostname label. I have tried using node-affinity to tackle this issue but affinity does not support two key and one value condition. I want to ignore the key of the label, and only match the value while placing.

Either cluster autoscaler should support the topology keys for scaling trigger or need logic to place the pod after scaling with OR condition in affinity.

Labels available for first node:

kubernetes.io/hostname=bar-common
project-role=bar-common # responsible for triggering scaleup

Label on the scaled nodes:

kubernetes.io/hostname=bar-common

nitingadekar commented 3 years ago

@korjek Found a wayout for this condition. Using affinity to operate OR condition for the node selectors has worked in my condition. Thanks for your help. Solution for my case:

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: "project-role"
            operator: In
            values:
            - bar-common
        - matchExpressions:
          - key: "kubernetes.io/hostname"
            operator: In
            values:
            - bar-common

The first node selector is considered for triggering scaleup and second is used for node scheduling. (seems weird but yes)

Still want to make a note here, that Cluster autoscaler for AWS does not consider the node-selector or node-affinity predicate if the system node labels like "kubernete.io/hostname" are used for scheduling.

Please confirm if this can be raised as a BUG or I should close the issue.

korjek commented 3 years ago

Still want to make a note here, that Cluster autoscaler for AWS does not consider the node-selector or node-affinity predicate if the system node labels like "kubernete.io/hostname" are used for scheduling.

Do you have this label specified in your ASG configuration?

nitingadekar commented 3 years ago

I have specified the hostname of the node while registering it in the cluster, which creates the system generated node label kubernetes.io/hostname=bar-common if hostname is "bar-common". In my ASG, the tags are as below.

aws:autoscaling:groupName:test-asg
k8s.io/cluster-autoscaler/enabled:true
k8s.io/cluster-autoscaler/cluster-acc:true 
kubernetes.io/cluster/c-fdwnz:owned

korjek commented 3 years ago

I mean if you do not have the label kubernetes.io/hostname=bar-common in your ASG configuration (honestly I do not think it's a good idea to add label kubernetes.io/hostname=bar-common for ASG), then CA doesn't know that bringing an additional node in this ASG with make a pod schedulable.

nitingadekar commented 3 years ago

This label is present, and node-selector does uses same selector to schedule. Problem is CA ignores this label and does not mark the node as schedulable.

korjek commented 3 years ago

This label is present

Do you mean you set for all nodes kubernetes.io/hostname=bar-common (ie kubernetes.io/hostname has a value bar-common for ALL nodes)?

nitingadekar commented 3 years ago

Yes, this label is present by default when the node is added in the cluster.

NickBenthem commented 3 years ago

@nitingadekar - not sure if this is your issue, but you may have to add your labels in the tags in ec2 console - https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html

Cluster Autoscaler can scale node groups to and from zero, which can yield significant cost savings. It detects the CPU, memory, and GPU resources of an Auto Scaling group by inspecting the InstanceType that is specified in its LaunchConfiguration or LaunchTemplate. Some pods require additional resources such as WindowsENI or PrivateIPv4Address or specific NodeSelectors or Taints, which can't be discovered from the LaunchConfiguration. The Cluster Autoscaler can account for these factors by discovering them from the following tags on the Auto Scaling group.

Key: k8s.io/cluster-autoscaler/node-template/resources/$RESOURCE_NAME
Value: 5
Key: k8s.io/cluster-autoscaler/node-template/label/$LABEL_KEY
Value: $LABEL_VALUE
Key: k8s.io/cluster-autoscaler/node-template/taint/$TAINT_KEY
Value: NoSchedule

You may also need to use the --kubelet-extra-args '--node-labels=key=value' in your User data in your ec2 launch configuration.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/3503#issuecomment-873527294): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sichiba commented 1 year ago

how did you manage resolving this issue?

kubernetes / autoscaler