Closed nitingadekar closed 3 years ago
@nitingadekar
what's your pod specification? could you please provide output of kubexctl describe pod YOUR_POD_HERE
.
and also what other pods are running in the cluster? (kubectl get pods -A
)
@korjek Pod description is as below which goes in pending once scaled beyond node capacity. Renaming service to avoid sharing client data.
└─ $ ▶ k describe pod foo-54c8d67c97-4t28z -n bar
Name: foo-54c8d67c97-4t28z
Namespace: bar
Priority: 0
Node: <none>
Labels: app.kubernetes.io/instance=foo
app.kubernetes.io/name=foo
pod-template-hash=54c8d67c97
Annotations: prometheus.io/path: /actuator/prometheus
prometheus.io/port: 9001
prometheus.io/scrape: true
Status: Pending
IP:
Controlled By: ReplicaSet/foo-54c8d67c97
Containers:
iom:
Image: registry.gitlab.com/prorepo/foo:v3.2.0-rc1
Port: 9001/TCP
Host Port: 0/TCP
Args:
--spring.profiles.active=k8-bar
-Xms150m
-Xmx285m
Limits:
cpu: 300m
memory: 300Mi
Requests:
cpu: 100m
memory: 150Mi
Liveness: exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from foo-token-b6kgl (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
foo-token-b6kgl:
Type: Secret (a volume populated by a Secret)
SecretName: foo-token-b6kgl
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/hostname=bar-common
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/20 nodes are available: 1 Insufficient cpu, 19 node(s) didn't match node selector.
Warning FailedScheduling <unknown> default-scheduler 0/20 nodes are available: 1 Insufficient cpu, 19 node(s) didn't match node selector.
Normal NotTriggerScaleUp 28s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
└─ $ ▶ k get pod -A -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
service1-856544f6fb-tf87d 1/1 Running 0 7d19h 10.42.44.24 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
service2-56bc85bf4b-hktpw 1/1 Running 0 12d 10.42.44.22 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
foo-54c8d67c97-4t28z 0/1 Pending 0 6m31s <none> <none> <none> <none>
foo-54c8d67c97-68rf8 1/1 Running 0 6m31s 10.42.44.25 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
foo-54c8d67c97-d74kd 1/1 Running 0 6m31s 10.42.44.27 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
foo-54c8d67c97-ftd52 1/1 Running 0 6m31s 10.42.44.26 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
foo-54c8d67c97-nl66q 1/1 Running 0 6m31s 10.42.44.29 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
foo-54c8d67c97-v7smg 1/1 Running 0 15d 10.42.44.15 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
foo-54c8d67c97-xl6kr 1/1 Running 0 6m31s 10.42.44.28 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
service3-76fc85f8b9-67s4v 1/1 Running 0 3m53s 10.42.8.3 ip-172-39-0-236.us-east-2.compute.internal <none> <none>
service4-79fc9cb4f9-kt97z 1/1 Running 0 16h 10.42.13.3 ip-172-39-0-137.us-east-2.compute.internal <none> <none>
service5-66687f8cd4-49qd7 1/1 Running 0 27d 10.42.44.9 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
service5-66687f8cd4-cdvmz 1/1 Running 0 27d 10.42.44.8 ip-172-39-0-140.us-east-2.compute.internal <none> <none>
To give you context, the rancher registers the nodes from ASG as nodes but does not assign any node pool to them. All nodes are registered as independent nodes, irrespective of their autoscaling group. I suspect the node-group/node-pool not being present as the reason for scale-up issue. But still, the scale down works as described in the description above.
The ASG hosting the pod is of instance type t3a.medium (4Gb ram/ 2vCPUs) which should ideally place above pod.
Let me know if you need any further details.
There is a node selector in your pod specification that refers to node name:
Node-Selectors: kubernetes.io/hostname=bar-common
.
I think this might be the reason: CA can't spin up a node with the specific name.
Try to remove this selector or at least try to use another label than kubernetes.io/hostname=bar-common
.
@korjek Based on the use case, the applications demand different instance types. In the above example, there are 3 types of nodes, created with 3 ASGs, and the pods are placed in their respective nodes as below. node type 1 (bar-common): service1, service2, foo, service5 node type 2(memory-asg) : service3 node type 3 (cpu-asg): service4
So basically the foo application should trigger scaleup in the first ASG with node label kubernetes.io/hostname=bar-common
.
According to your suggestion, I have removed the node-selector for foo pods, and tried scaling up the pods to choke up the entire cluster capacity,
It did work after scaling to 500 replicas, as below
k describe pod foo-5f887c5db-zjwhk -n bar
Name: foo-5f887c5db-zjwhk
Namespace: bar
Priority: 0
Node: ip-172-39-0-24.us-east-2.compute.internal/172.39.0.24
Start Time: Sun, 20 Sep 2020 01:29:54 +0530
Labels: app.kubernetes.io/instance=foo
app.kubernetes.io/name=foo
pod-template-hash=5f887c5db
Annotations: cni.projectcalico.org/podIP: 10.42.11.46/32
Status: Running
IP: 10.42.11.46
Controlled By: ReplicaSet/foo-5f887c5db
Containers:
foo:
Container ID: docker://9af8a6889f0f1c25e64ffe92921a3d891792d7f19f6809d674f66e726c28582e
Image: registry.gitlab.com/prorepo/foo:v1
Image ID: docker-pullable://registry.gitlab.com/prorepo/foo@sha256:efb97fd684d4f48da9690a781f6254f78fa16488c05bb46c55657e3fc5e9cba7
Port: 9001/TCP
Host Port: 0/TCP
Args:
--spring.profiles.active=k8-bar
State: Running
Started: Sun, 20 Sep 2020 01:29:57 +0530
Ready: True
Restart Count: 0
Limits:
cpu: 300m
memory: 300Mi
Requests:
cpu: 100m
memory: 150Mi
Liveness: exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [echo hello world] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from foo-token-b6kgl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
foo-token-b6kgl:
Type: Secret (a volume populated by a Secret)
SecretName: foo-token-b6kgl
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/19 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/24 nodes are available: 16 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/19 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/21 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 6 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/24 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 9 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/24 nodes are available: 11 Insufficient cpu, 4 Insufficient memory, 9 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/24 nodes are available: 16 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taints that the pod didn't tolerate.
Normal Scheduled <unknown> default-scheduler Successfully assigned bar/foo-5f887c5db-zjwhk to ip-172-39-0-24.us-east-2.compute.internal
Normal TriggeredScaleUp 26m cluster-autoscaler pod triggered scale-up: [{memory-asg 2->6 (max: 10)}]
Normal Pulled 21m kubelet, ip-172-39-0-24.us-east-2.compute.internal Container image "registry.gitlab.com/prorepo/foo:v1" already present on machine
Normal Created 21m kubelet, ip-172-39-0-24.us-east-2.compute.internal Created container foo
Normal Started 21m kubelet, ip-172-39-0-24.us-east-2.compute.internal Started container foo
Is the problem with node-selector that cluster autoscaler didn't recognized the bar-common asg when node-selector was used? or the CA do not support scaling with node-selector. Can you shed more light on this?
Try to use other label than kubernetes.io/hostname
.
For example, add a label to your node group like k8s.io/cluster-autoscaler/node-template/label/pool=bar-common
.
Then use this label as a node selector in your pod configuration: Node-Selectors: pool=bar-common
Hi Thanks @korjek, Issue is with the topology key kubernetes.io/hostname, this key is ignored as a predicate while scheduling, and this is why the pod was not triggering scaleup. It worked with a custom label.
I have another situation here. While adding the custom nodes using AWS ASG in Rancher cluster, there is a bug in rancher cluster agent, that assigns the custom node labels to only first node and if you scale the nodes further, the new nodes do not get the custom nodes. But somehow by setting the hostame which created this kubernetes.io/hostname=bar-common labels I got the custom label type key pair created in new nodes. So for triggering the scaleup, I need custom-labels, and for placement of the pod I need hostname label. I have tried using node-affinity to tackle this issue but affinity does not support two key and one value condition. I want to ignore the key of the label, and only match the value while placing.
Either cluster autoscaler should support the topology keys for scaling trigger or need logic to place the pod after scaling with OR condition in affinity.
kubernetes.io/hostname=bar-common
project-role=bar-common # responsible for triggering scaleup
kubernetes.io/hostname=bar-common
@korjek Found a wayout for this condition. Using affinity to operate OR condition for the node selectors has worked in my condition. Thanks for your help. Solution for my case:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "project-role"
operator: In
values:
- bar-common
- matchExpressions:
- key: "kubernetes.io/hostname"
operator: In
values:
- bar-common
The first node selector is considered for triggering scaleup and second is used for node scheduling. (seems weird but yes)
Still want to make a note here, that Cluster autoscaler for AWS does not consider the node-selector or node-affinity predicate if the system node labels like "kubernete.io/hostname" are used for scheduling.
Please confirm if this can be raised as a BUG or I should close the issue.
Still want to make a note here, that Cluster autoscaler for AWS does not consider the node-selector or node-affinity predicate if the system node labels like "kubernete.io/hostname" are used for scheduling.
Do you have this label specified in your ASG configuration?
I have specified the hostname of the node while registering it in the cluster, which creates the system generated node label kubernetes.io/hostname=bar-common if hostname is "bar-common". In my ASG, the tags are as below.
aws:autoscaling:groupName:test-asg
k8s.io/cluster-autoscaler/enabled:true
k8s.io/cluster-autoscaler/cluster-acc:true
kubernetes.io/cluster/c-fdwnz:owned
I mean if you do not have the label kubernetes.io/hostname=bar-common
in your ASG configuration (honestly I do not think it's a good idea to add label kubernetes.io/hostname=bar-common
for ASG), then CA doesn't know that bringing an additional node in this ASG with make a pod schedulable.
This label is present, and node-selector does uses same selector to schedule. Problem is CA ignores this label and does not mark the node as schedulable.
This label is present
Do you mean you set for all nodes kubernetes.io/hostname=bar-common
(ie kubernetes.io/hostname
has a value bar-common
for ALL nodes)?
Yes, this label is present by default when the node is added in the cluster.
@nitingadekar - not sure if this is your issue, but you may have to add your labels in the tags in ec2 console - https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html
Cluster Autoscaler can scale node groups to and from zero, which can yield significant cost savings. It detects the CPU, memory, and GPU resources of an Auto Scaling group by inspecting the InstanceType that is specified in its LaunchConfiguration or LaunchTemplate. Some pods require additional resources such as WindowsENI or PrivateIPv4Address or specific NodeSelectors or Taints, which can't be discovered from the LaunchConfiguration. The Cluster Autoscaler can account for these factors by discovering them from the following tags on the Auto Scaling group.
Key: k8s.io/cluster-autoscaler/node-template/resources/$RESOURCE_NAME
Value: 5
Key: k8s.io/cluster-autoscaler/node-template/label/$LABEL_KEY
Value: $LABEL_VALUE
Key: k8s.io/cluster-autoscaler/node-template/taint/$TAINT_KEY
Value: NoSchedule
You may also need to use the --kubelet-extra-args '--node-labels=key=value' in your User data in your ec2 launch configuration.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@fejta-bot: Closing this issue.
how did you manage resolving this issue?
I have created custom nodes in Rancher (aws provider) cluster following doc https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cluster-autoscaler/amazon/
After deploying cluster-autoscaler with nodes auto-discovery and also in single node with below tags and configuration, the scale down works fine, but scaleup gives
I have verified the known reasons for the above error :
Tags to the ASG and respective nodes
Cluster-autoscaler command with auto-discovery :
Cluster single node command:
In both cases, the issue is the same where scale down works but scale-up is not getting triggered
Considering the scale down, The autoscaler is scaling down nodes if they are vacant or underutilized.
As the error msg is not clear enough to understand the root cause, I am helpless with no workaround for this issue. Any help suggestion will be appreciated, thanks in advance.