kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.08k stars 3.97k forks source link

Labels match but Cluster Autoscaler says "are not similar, labels do not match" when trying to balance similar node groups. #6954

Open nicksecurity opened 4 months ago

nicksecurity commented 4 months ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?: Component version: 1.28.5

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.9-eks-036c24b

What environment is this in?: AWS, EKS using EC2

What did you expect to happen?: When new nodes are added I expect them to be balanced on to 3 similar node groups, which have the same labels.

What happened instead?: I have 3 node groups, one per AZ, the new nodes are only added to 1 node group.

The error says the labels are different, but I've checked them all, and except for a couple I've excluded, they are similar.

I0621 14:19:48.338960 1 compare_nodegroups.go:157] nodes template-node-for-eks-nodegroup-4-128-1ec80b90 and template-node-for-eks-nodegroup-3-128-f0c80b8d are not similar, labels do not match

How to reproduce it (as minimally and precisely as possible): I scale up the pods so it will add several new nodes and then check to see which nodegroup it added them to.

Anything else we need to know?: No

adrianmoisey commented 4 months ago

/area cluster-autoscaler

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jbilliau-rcd commented 2 weeks ago

Having same issue when testing out cluster-autoscaler 1.29.4....I have 3 node groups, one per AZ, and spun up 20 pods....it increased only ONE node group by 11?

ster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182407       1 klogx.go:87] Pod beta-whale/beta-whale-5fb675658c-n9kls is unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182410       1 klogx.go:87] Pod beta-whale/beta-whale-5fb675658c-hmfr7 is unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182412       1 klogx.go:87] Pod beta-whale/beta-whale-5fb675658c-zx4gc is unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182414       1 klogx.go:87] Pod beta-whale/beta-whale-5fb675658c-nqr45 is unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182416       1 klogx.go:87] Pod beta-whale/beta-whale-5fb675658c-krgdk is unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182423       1 klogx.go:87] Pod kube-system/overprovisioning-757f8f8fbc-9bftr is unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.182425       1 klogx.go:87] 1 other pods are also unschedulable
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.184241       1 orchestrator.go:108] Upcoming 0 nodes
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.184397       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6-1335741201004455891 and template-node-for-eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302-574218763908410739 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.184413       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6-1335741201004455891 and template-node-for-eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b-1772456504083316692 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.185704       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302-574218763908410739 and template-node-for-eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b-1772456504083316692 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.185720       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302-574218763908410739 and template-node-for-eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6-1335741201004455891 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.186994       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b-1772456504083316692 and template-node-for-eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6-1335741201004455891 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.187009       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b-1772456504083316692 and template-node-for-eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302-574218763908410739 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188273       1 priority.go:114] Successfully loaded priority configuration from configmap.
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188285       1 priority.go:163] priority expander: eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 chosen as the highest available
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188288       1 priority.go:163] priority expander: eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 chosen as the highest available
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188291       1 priority.go:163] priority expander: eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b chosen as the highest available
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188298       1 orchestrator.go:181] Best option to resize: eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188307       1 orchestrator.go:185] Estimated 11 nodes needed in eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188323       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6-1335741201004455891 and template-node-for-eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302-574218763908410739 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188335       1 compare_nodegroups.go:157] nodes template-node-for-eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6-1335741201004455891 and template-node-for-eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b-1772456504083316692 are not similar, labels do not match
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188342       1 orchestrator.go:249] No similar node groups found
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188358       1 orchestrator.go:291] Final scale-up plan: [{eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 2->13 (max: 50)}]
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188385       1 executor.go:147] Scale-up: setting group eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 size to 13
cluster-autoscaler-86c94b64cd-mv5mm aws-cluster-autoscaler I1031 14:33:48.188409       1 auto_scaling_groups.go:267] Setting asg eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 size to 13
jbilliau-rcd commented 2 weeks ago

Hmmm never mind, just fixed by switching to --balance-labels.....balance-ignore-labels seems overly complicated and is prone to breaking if labels change on nodes. @nicksecurity see this PR here - https://github.com/kubernetes/autoscaler/pull/4174

cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230480       1 priority.go:163] priority expander: eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 chosen as the highest available
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230484       1 priority.go:163] priority expander: eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 chosen as the highest available
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230486       1 priority.go:163] priority expander: eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b chosen as the highest available
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230493       1 orchestrator.go:181] Best option to resize: eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230498       1 orchestrator.go:185] Estimated 8 nodes needed in eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230520       1 orchestrator.go:246] Found 2 similar node groups: [eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6]
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230548       1 orchestrator.go:281] Splitting scale-up between 3 similar node groups: {eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302, eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b, eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6}
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230567       1 orchestrator.go:291] Final scale-up plan: [{eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b 1->5 (max: 50)} {eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 2->5 (max: 50)} {eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 3->4 (max: 50)}]
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230583       1 executor.go:147] Scale-up: setting group eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b size to 5
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230606       1 auto_scaling_groups.go:267] Setting asg eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b size to 5
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.230830       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"1cd02fde-5df3-4755-a3be-880101f4f685", APIVersion:"v1", ResourceVersion:"110291641", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b size to 5 instead of 1 (max: 50)
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.353381       1 executor.go:147] Scale-up: setting group eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 size to 5
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.353413       1 auto_scaling_groups.go:267] Setting asg eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 size to 5
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.353488       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"1cd02fde-5df3-4755-a3be-880101f4f685", APIVersion:"v1", ResourceVersion:"110291641", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group eks-spot-jason20241030133218211400000007-cec96ed0-43eb-c9e9-8f1c-1d9adaa3c01b size set to 5 instead of 1 (max: 50)
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.367471       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"1cd02fde-5df3-4755-a3be-880101f4f685", APIVersion:"v1", ResourceVersion:"110291641", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 size to 5 instead of 2 (max: 50)
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.434135       1 executor.go:147] Scale-up: setting group eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 size to 4
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.434161       1 auto_scaling_groups.go:267] Setting asg eks-spot-jason20241030133218199100000003-02c96ed0-43e2-76e6-7f01-f1c22b7851f6 size to 4
cluster-autoscaler-759cb97bf8-nrdjc aws-cluster-autoscaler I1031 14:59:06.434240       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"1cd02fde-5df3-4755-a3be-880101f4f685", APIVersion:"v1", ResourceVersion:"110291641", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group eks-spot-jason20241030133218207200000005-dac96ed0-43e6-8b44-fa5d-8ffe23060302 size set to 5 instead of 2 (max: 50)