kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.09k stars 3.97k forks source link

Scale up from 0 does not work with existing AWS EBS CSI PersistentVolume #3845

Closed Xyaren closed 1 year ago

Xyaren commented 3 years ago

Which component are you using?:

What version of the component are you using?:

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

What did you expect to happen?: I do have an ASG dedicated to a single CronJob, that get's triggered 6 times a day. That ASG is pinned to a specific AWS AZ by it's assigned subnet. The Cronjob is pinned to that specific ASG by Affinity+Toleration The job uses a PV, that will be provisioned (AWS EBS) on the first ever run and then subsequently reused on each run. I expect the ASG to be scaled up to 1 after the Pod gets created and removed shortly after the Pod/Job has finished.

What happened instead?:

The ASG will not be scaled up by the cluster-autoscaler.

cluster-autoscaler log output after the Job is created and the Pod is pending
2021-01-25T05:19:22.523Z : Starting main loop           
2021-01-25T05:19:22.524Z : "Found multiple availability zones for ASG "mycompany-test-eks-myapp-elastic-group-1-20210108154118845300000003"  using eu-central-1a"       
2021-01-25T05:19:22.525Z : "Found multiple availability zones for ASG "mycompany-test-eks-myapp-worker-group-2-20201029130225136800000004"   using eu-central-1a"       
2021-01-25T05:19:22.525Z : "Found multiple availability zones for ASG "mycompany-test-eks-worker-group-1-20201029130715836900000005"     using eu-central-1a"       
2021-01-25T05:19:22.526Z : Filtering out schedulables           
2021-01-25T05:19:22.526Z : 0 pods marked as unschedulable can be scheduled.         
2021-01-25T05:19:22.526Z : No schedulable pods          
2021-01-25T05:19:22.526Z : Pod myapp-masterdata/masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw is unschedulable            
2021-01-25T05:19:22.526Z : Upcoming 0 nodes         
2021-01-25T05:19:22.526Z : Skipping node group mycompany-test-eks-myapp-elastic-group-1-20210108154118845300000003 - max size reached           
2021-01-25T05:19:22.526Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-elastic-group-2-20201029130715759300000004, predicate checking error: node(s) didn't match node selector  predicateName=NodeAffinity  reasons: node(s) didn't match node selector     debugInfo="
2021-01-25T05:19:22.526Z : No pod can fit to mycompany-test-eks-myapp-elastic-group-2-20201029130715759300000004            
2021-01-25T05:19:22.526Z : "Could not get a CSINode object for the node "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836": csinode.storage.k8s.io "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836" not found"          
2021-01-25T05:19:22.527Z : "PersistentVolume "pvc-ef85dcce-e63e-42da-b869-c3389bbd948d", Node "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836" mismatch for Pod "myapp-masterdata/masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw": No matching NodeSelectorTerms"         
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003, predicate checking error: node(s) had volume node affinity conflict     predicateName=VolumeBinding     reasons: node(s) had volume node affinity conflict  debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003          
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-worker-group-120200916154409048800000006, predicate checking error: node(s) didn't match node selector    predicateName=NodeAffinity  reasons: node(s) didn't match node selector     debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-myapp-worker-group-120200916154409048800000006          
2021-01-25T05:19:22.527Z : Skipping node group mycompany-test-eks-myapp-worker-group-2-20201029130225136800000004 - max size reached            
2021-01-25T05:19:22.527Z : Skipping node group mycompany-test-eks-worker-group-1-20201029130715836900000005 - max size reached          
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-worker-group-220200916162252020100000006, predicate checking error: node(s) didn't match node selector  predicateName=NodeAffinity  reasons: node(s) didn't match node selector     debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-worker-group-220200916162252020100000006            
2021-01-25T05:19:22.527Z : No expansion options         
2021-01-25T05:19:22.527Z : Calculating unneeded nodes           
[...]
2021-01-25T05:19:22.528Z : Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s            
2021-01-25T05:19:22.528Z : Scale down status: unneededOnly=false lastScaleUpTime=2021-01-25 05:00:14.980160831 +0000 UTC m=+6970.760701246 lastScaleDownDeleteTime=2021-01-25 03:04:22.928996296 +0000 UTC m=+18.709536671 lastScaleDownFailTime=2021-01-25 03:04:22.928996376 +0000 UTC m=+18.709536751 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false            
2021-01-25T05:19:22.528Z : Starting scale down          
2021-01-25T05:19:22.528Z : No candidates for scale down         
2021-01-25T05:19:22.528Z : "Event(v1.ObjectReference{Kind:"Pod", Namespace:"myapp-masterdata", Name:"masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw", UID:"97956c38-55f3-4749-ab74-7e7fc674e832", APIVersion:"v1", ResourceVersion:"217276797", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 max node group size reached, 3 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict"         
2021-01-25T05:19:22.946Z : k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309: Watch close - *v1beta1.PodDisruptionBudget total 0 items received          
2021-01-25T05:19:32.542Z : Starting main loop           

Anything else we need to know?: Basically this works fine without the volume. With the volume it works when the volume is not provisioned yet, but fails when it already has been provisioned. The job also get's scheduled right away when I manually upscale the ASG.

I noticed the volume affinity on the PVC :

Node Affinity:                                                                                                                                │
  Required Terms:                                                                                                                             │
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-central-1b] 

That tag is probably set on the node by the "ebs-csi-node" DaemonSet and therefore is unknown for the cluster-autoscaler.

Am I expected to tag the ASG with k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone ? If so, how am I supposed to set them in a Multi-AZ ASGs ?

Possibly related: https://github.com/kubernetes/autoscaler/issues/3230

westernspion commented 3 years ago

Same problem here (edit after realizing there is no difference relevant difference in my previous post to what you wrote)

After doing some splunking, I you are correct it has something to do with scaling from 0 and usage of the topology.ebs.csi.aws.com/zone label and the ability of the autoscaler to recognize it. Some experimentation corroborates this.

westernspion commented 3 years ago

k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone is the approach I am taking and it works like charm.

I can do some footwork in terraform to get the tags setup. Not sure what you're using to provision your cluster.

Though, it would be nice to have the labels generated from the list of AZs assigned to an ASG

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

mparikhcloudbeds commented 3 years ago

How to resolve this issue for statefulset deployments attached custom storage classes on EKS?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FarhanSajid1 commented 2 years ago

How to resolve this issue for statefulset deployments attached custom storage classes on EKS?

So just set

k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone: "us-east-2a"

for example? Like the OP mentions, how are we supposed to do this for multiple AZs

iomarcovalente commented 2 years ago

I have this exact problem too, to add further info the error I get on the pod unable to scale from zero is: pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict

jbg commented 2 years ago

@FarhanSajid1 you should have one node group (and thus one ASG) for each AZ. The above tag needs to be applied to the ASG.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

decipher27 commented 2 years ago

Hi Folks! Facing the same issue: CA Version: v1.21.1 aws-ebs-csi-driver Version v1.10.0-eksbuild.1

Cluster-autosacler logs:

I0920 17:30:00.585954       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-173-251.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-173-251.ap-south-1.compute.internal" not found
I0920 17:30:00.586008       1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-173-251.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586074       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-68-79.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-68-79.ap-south-1.compute.internal" not found
I0920 17:30:00.586107       1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "ip-10-121-68-79.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586149       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-162-179.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-162-179.ap-south-1.compute.internal" not found
I0920 17:30:00.586172       1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-162-179.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586247       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-241-242.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-241-242.ap-south-1.compute.internal" not found
I0920 17:30:00.586275       1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-241-242.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586328       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-5-204.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-5-204.ap-south-1.compute.internal" not found
I0920 17:30:00.586350       1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "ip-10-121-5-204.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586533       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-173-251.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-173-251.ap-south-1.compute.internal" not found
I0920 17:30:00.586572       1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-173-251.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586622       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-68-79.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-68-79.ap-south-1.compute.internal" not found
I0920 17:30:00.586663       1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "ip-10-121-68-79.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586711       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-162-179.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-162-179.ap-south-1.compute.internal" not found
I0920 17:30:00.586737       1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-162-179.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586802       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-241-242.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-241-242.ap-south-1.compute.internal" not found
I0920 17:30:00.586827       1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-241-242.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586869       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-5-204.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-5-204.ap-south-1.compute.internal" not found
I0920 17:30:00.586907       1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "ip-10-121-5-204.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586929       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0920 17:30:00.586938       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0920 17:30:00.586952       1 filter_out_schedulable.go:82] No schedulable pods
I0920 17:30:00.586966       1 klogx.go:86] Pod kafka/kafka-0 is unschedulable
I0920 17:30:00.586972       1 klogx.go:86] Pod kafka/kafka-1 is unschedulable
I0920 17:30:00.587014       1 scale_up.go:376] Upcoming 0 nodes
I0920 17:30:00.587153       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083": csinode.storage.k8s.io "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" not found
I0920 17:30:00.587188       1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.587210       1 scale_up.go:300] Pod kafka-0 can't be scheduled on eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0920 17:30:00.587316       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083": csinode.storage.k8s.io "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" not found
I0920 17:30:00.587361       1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.587386       1 scale_up.go:300] Pod kafka-1 can't be scheduled on eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0920 17:30:00.587417       1 scale_up.go:449] No pod can fit to eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09

Our pods are in pending state due to volume node affinity conflict.

Describe kafka-1 pod

LAST SEEN   TYPE      REASON              OBJECT        MESSAGE
6m52s       Warning   FailedScheduling    pod/kafka-0   0/5 nodes are available: 5 node(s) had volume node affinity conflict.
6m52s       Warning   FailedScheduling    pod/kafka-1   0/5 nodes are available: 5 node(s) had volume node affinity conflict.
73s         Normal    NotTriggerScaleUp   pod/kafka-0   pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
73s         Normal    NotTriggerScaleUp   pod/kafka-1   pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
JBOClara commented 2 years ago

Hi @decipher27 ,

Could you show us the labels on you AWS ASG aws autoscaling describe-auto-scaling-groups ?

My understanding of this issue is that you need the topology tags:

                {
                    "ResourceId": "eks-spot-2-XXXX",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

I've also added

                {
                    "ResourceId": "eks-spot-2-5xxxx",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

When your ASG is at 0, there no node to retrieve the topogy from. You must have topology labels on ASG itself to allow CA and CSI Driver to retrieve the topology.

decipher27 commented 2 years ago

We don't have the mentioned tags mentioned above, and it was working earlier. Though, we found the issue was with the scheduler. we are using a custom scheduler.. Our vendor had made some tweaks and it's fixed. Thank you @JBOClara

decipher27 commented 2 years ago

Also, from your comment, what do you mean by When your ASG is at 0? You mean if I set the desired count to be '0'?

JBOClara commented 2 years ago

Also, from your comment, what do you mean by When your ASG is at 0? You mean if I set the desired count to be '0'? @decipher27

Exactly, when an ASG desired value is set to 0 (for instance, after a downscale of all replicas with kube-downscaler, except those from CA itself). CA will not be able to read node labels, because there is no node.

debu99 commented 2 years ago

Got the same issue, if a pvc & pod created and then suspend the asg group & scaled down the asg to 0 to save cost at weekend, but on Monday this pod is not able to start from 0, other stateless pods are okay

JBOClara commented 2 years ago

@debu99 Look at:

Hi @decipher27 ,

Could you show us the labels on you AWS ASG aws autoscaling describe-auto-scaling-groups ?

My understanding of this issue is that you need the topology tags:

                {
                    "ResourceId": "eks-spot-2-XXXX",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

I've also added

                {
                    "ResourceId": "eks-spot-2-5xxxx",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

When your ASG is at 0, there no node to retrieve the topogy from. You must have topology labels on ASG itself to allow CA and CSI Driver to retrieve the topology.

debu99 commented 2 years ago

my pv requires

Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [ap-southeast-1a]

But I believe this label is added automatically to all nodes? as i didn't add it into ASG tags, but all my nodes has it

ip-10-40-44-63.ap-southeast-1.compute.internal    Ready    <none>   5h3m    v1.21.14-eks-ba74326   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3a.large,beta.kubernetes.io/os=linux,dedicated=redis,failure-domain.beta.kubernetes.io/region=ap-southeast-1,failure-domain.beta.kubernetes.io/zone=ap-southeast-1b,k8s-node-lifecycle=on-demand,k8s-node-role/on-demand-worker=true,k8s-node-role/type=none,k8s-node/instance-level=large,k8s-node/worker-type=t-type,k8s.io/cloud-provider-aws=be298adc77b66eafc3745cf0a9c131e0,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-40-44-63.ap-southeast-1.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3a.large,sb-subnet/type=primary,sb-subnet/zone-id=1,topology.ebs.csi.aws.com/zone=ap-southeast-1b,topology.kubernetes.io/region=ap-southeast-1,topology.kubernetes.io/zone=ap-southeast-1b
ip-10-40-7-219.ap-southeast-1.compute.internal    Ready    <none>   25m     v1.21.14-eks-ba74326   beta.kubernetes.io/arch=arm64,beta.kubernetes.io/instance-type=r6g.large,beta.kubernetes.io/os=linux,dedicated=prometheus-operator,failure-domain.beta.kubernetes.io/region=ap-southeast-1,failure-domain.beta.kubernetes.io/zone=ap-southeast-1a,k8s-node-lifecycle=on-demand,k8s-node-role/on-demand-worker=true,k8s-node-role/type=none,k8s-node/instance-level=large,k8s-node/worker-type=r-type,k8s.io/cloud-provider-aws=be298adc77b66eafc3745cf0a9c131e0,kubernetes.io/arch=arm64,kubernetes.io/hostname=ip-10-40-7-219.ap-southeast-1.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=r6g.large,sb-subnet/type=primary,sb-subnet/zone-id=0,topology.ebs.csi.aws.com/zone=ap-southeast-1a,topology.kubernetes.io/region=ap-southeast-1,topology.kubernetes.io/zone=ap-southeast-1a
jbg commented 2 years ago

Yes, but when the ASG is at 0, there are no nodes. cluster-autoscaler needs the labels tagged on the ASG to know what labels the node would have if it would scale up the ASG from 0.

KiranReddy230 commented 1 year ago

We are facing the same issue with VolumeNodeAffinity error, and our ASG has node Spun Across AZs, What is the best way for CA to spin up the nodes based on the right AZ. We use the priority expander. Also CA takes throws the error:

I0103 17:43:29.663090       1 scale_up.go:449] No pod can fit to eks-atlan-node-spot-c2c299ee-8af5-1b60-2ce3-2e4dc50b5484
I0103 17:43:29.663106       1 scale_up.go:453] No expansion options

Above error comes when there is enough room for CA to spin up new nodes in the Nodegroup and also there is one more nodegroup where CA can launch, but CA not functioning as expected. CA version: 1.21

jbg commented 1 year ago

@KiranReddy230 if you read the comments above yours, the question has been answered three times already. You need to add the tags mentioned above to your ASG. In order for this to work properly, each node group (and thus each ASG) should have only one zone (this is the recommended architecture anyway).

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

michalschott commented 1 year ago

I have this issue despite (I believe) having everything set up correctly.

EKS - 1.25

CA - 1.25.2:

      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/XXX
        - --balance-similar-node-groups=true
        - --emit-per-nodegroup-metrics=true
        - --expander=most-pods,least-waste
        - --ignore-taint=node.cilium.io/agent-not-ready
        - --logtostderr=true
        - --namespace=kube-system
        - --regional=true
        - --scan-interval=1m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=error
        - --v=0
        env:
        - name: AWS_REGION
          value: eu-west-1

My 3 ASGs are tagged as following (each of them covers single region a/b/c):

k8s.io/cluster-autoscaler/node-template/label/failure-domain.beta.kubernetes.io/zone    eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type  m5.2xlarge
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone  eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/region  eu-west-1
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone    eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/agent-not-ready    true:NO_EXECUTE Yes

I'm running Prometheus as STS with PVC (affinity rules set to ensure replicas are spread across AZ and hosts):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    polaris.fairwinds.com/automountServiceAccountToken-exempt: "true"
    prometheus-operator-input-hash: "4772490143308579296"
  creationTimestamp: "2023-03-03T20:52:48Z"
  generation: 56
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 47.0.0
    argocd.argoproj.io/instance: xxx-prometheus
    chart: kube-prometheus-stack-47.0.0
    heritage: Helm
    operator.prometheus.io/mode: server
    operator.prometheus.io/name: prometheus-prometheus
    operator.prometheus.io/shard: "0"
    release: prometheus
  name: prometheus-prometheus-prometheus
  namespace: prometheus
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: Prometheus
    name: prometheus-prometheus
    uid: ce818fdf-02b4-4718-a430-f4ff4c5acbc5
  resourceVersion: "342440131"
  uid: 662e082a-af26-40e4-b39e-d354a023fe0a
spec:
  podManagementPolicy: Parallel
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: prometheus-prometheus
      app.kubernetes.io/managed-by: prometheus-operator
      app.kubernetes.io/name: prometheus
      operator.prometheus.io/name: prometheus-prometheus
      operator.prometheus.io/shard: "0"
      prometheus: prometheus-prometheus
  serviceName: prometheus-operated
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        kubectl.kubernetes.io/default-container: prometheus
        linkerd.io/inject: enabled
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: prometheus-prometheus
        app.kubernetes.io/managed-by: prometheus-operator
        app.kubernetes.io/name: prometheus
        app.kubernetes.io/version: 2.44.0
        operator.prometheus.io/name: prometheus-prometheus
        operator.prometheus.io/shard: "0"
        prometheus: prometheus-prometheus
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: prometheus-prometheus
                app.kubernetes.io/name: prometheus
                prometheus: prometheus-prometheus
            topologyKey: topology.kubernetes.io/zone
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: prometheus-prometheus
                app.kubernetes.io/name: prometheus
                prometheus: prometheus-prometheus
            topologyKey: kubernetes.io/hostname
      automountServiceAccountToken: true
      containers:
      - args:
        - --web.console.templates=/etc/prometheus/consoles
        - --web.console.libraries=/etc/prometheus/console_libraries
        - --config.file=/etc/prometheus/config_out/prometheus.env.yaml
        - --web.enable-lifecycle
        - --web.external-url=https://prometheus.xxx.xxx
        - --web.route-prefix=/
        - --log.level=error
        - --log.format=json
        - --storage.tsdb.retention.time=3h
        - --storage.tsdb.path=/prometheus
        - --storage.tsdb.wal-compression
        - --web.config.file=/etc/prometheus/web_config/web-config.yaml
        - --storage.tsdb.max-block-duration=2h
        - --storage.tsdb.min-block-duration=2h
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus/prometheus:v2.44.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /-/healthy
            port: http-web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        name: prometheus
        ports:
        - containerPort: 9090
          name: http-web
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: http-web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        resources:
          limits:
            memory: 20Gi
          requests:
            cpu: 300m
            memory: 20Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        startupProbe:
          failureThreshold: 60
          httpGet:
            path: /-/ready
            port: http-web
            scheme: HTTP
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 3
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config_out
          name: config-out
          readOnly: true
        - mountPath: /etc/prometheus/certs
          name: tls-assets
          readOnly: true
        - mountPath: /prometheus
          name: prometheus-prometheus-prometheus-db
          subPath: prometheus-db
        - mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
          name: prometheus-prometheus-prometheus-rulefiles-0
        - mountPath: /etc/prometheus/web_config/web-config.yaml
          name: web-config
          readOnly: true
          subPath: web-config.yaml
      - args:
        - --listen-address=:8080
        - --reload-url=http://127.0.0.1:9090/-/reload
        - --config-file=/etc/prometheus/config/prometheus.yaml.gz
        - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
        - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
        - --log-level=error
        - --log-format=json
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "0"
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus-operator/prometheus-config-reloader:v0.66.0
        imagePullPolicy: Always
        name: config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
        - mountPath: /etc/prometheus/config_out
          name: config-out
        - mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
          name: prometheus-prometheus-prometheus-rulefiles-0
      - args:
        - sidecar
        - --prometheus.url=http://127.0.0.1:9090/
        - '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
        - --grpc-address=:10901
        - --http-address=:10902
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/prometheus
        - --log.level=error
        - --log.format=json
        env:
        - name: OBJSTORE_CONFIG
          valueFrom:
            secretKeyRef:
              key: config
              name: thanos-config
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/bitnami/thanos:0.31.0
        imagePullPolicy: Always
        name: thanos-sidecar
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        - containerPort: 10901
          name: grpc
          protocol: TCP
        resources:
          limits:
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /prometheus
          name: prometheus-prometheus-prometheus-db
          subPath: prometheus-db
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --watch-interval=0
        - --listen-address=:8080
        - --config-file=/etc/prometheus/config/prometheus.yaml.gz
        - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
        - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
        - --log-level=error
        - --log-format=json
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "0"
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus-operator/prometheus-config-reloader:v0.66.0
        imagePullPolicy: Always
        name: init-config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
        - mountPath: /etc/prometheus/config_out
          name: config-out
        - mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
          name: prometheus-prometheus-prometheus-rulefiles-0
      nodeSelector:
        node.kubernetes.io/instance-type: m5.2xlarge
      priorityClassName: prometheus
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 2000
        runAsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: prometheus-prometheus
      serviceAccountName: prometheus-prometheus
      terminationGracePeriodSeconds: 600
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: prometheus-prometheus
            app.kubernetes.io/name: prometheus
            prometheus: prometheus-prometheus
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: prometheus-prometheus-prometheus
      - name: tls-assets
        projected:
          defaultMode: 420
          sources:
          - secret:
              name: prometheus-prometheus-prometheus-tls-assets-0
      - emptyDir:
          medium: Memory
        name: config-out
      - configMap:
          defaultMode: 420
          name: prometheus-prometheus-prometheus-rulefiles-0
        name: prometheus-prometheus-prometheus-rulefiles-0
      - name: web-config
        secret:
          defaultMode: 420
          secretName: prometheus-prometheus-prometheus-web-config
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: prometheus-prometheus-prometheus-db
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: ebs-sc-preserve
      volumeMode: Filesystem
    status:
      phase: Pending
~ k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                                               STORAGECLASS      REASON   AGE
pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f   10Gi       RWO            Retain           Bound    prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0   ebs-sc-preserve            61d
pvc-f40a6589-6fcf-4419-9486-70e5efa43575   10Gi       RWO            Retain           Bound    prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-1   ebs-sc-preserve            9d

~ k describe pv pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f pvc-f40a6589-6fcf-4419-9486-70e5efa43575
Name:              pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
                   volume.kubernetes.io/provisioner-deletion-secret-name:
                   volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:        [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass:      ebs-sc-preserve
Status:            Bound
Claim:             prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0
Reclaim Policy:    Retain
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-west-1c]
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-08b0f4a31f192dad7
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1683859406228-8081-ebs.csi.aws.com
Events:                <none>

Name:              pvc-f40a6589-6fcf-4419-9486-70e5efa43575
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
                   volume.kubernetes.io/provisioner-deletion-secret-name:
                   volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:        [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass:      ebs-sc-preserve
Status:            Bound
Claim:             prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-1
Reclaim Policy:    Retain
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-west-1b]
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-07d31d533b2e01a4b
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1687797020030-8081-ebs.csi.aws.com
Events:                <none>

Every night between 00:00 - 06:00 (I believe this is when AWS rebalancing happens) at least one of prometheus replicas is being stuck in Pending state. Once cluster-autoscaler is being restarted - k -n kube-system rollout restart deploy cluster-autoscaler - ASG is being properly scheduled up.

For now I had to set minCapacity = 1 for these ASGs to prevent such situations.

mmerrill3 commented 1 year ago

This is closely related to issue #4739, which was fixed in cluster autoscaler version 1.22 onward. If you look at the function that generates a hypothetical new node to satisfy the pending pod, the new label that is needed to satisfy volumes created by the EBS CSI driver is not part of that function. It will not scale up unless you add the tag to the ASG manually.

Current function: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L409

The next function is why adding the labels to the ASG makes this work

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L423

Since the annotation is widely used now, maybe we update the buildGenericLabels function to use the label topology.ebs.csi.aws.com/zone as well for the new node when its hypothetically being built.

msvticket commented 1 year ago

I can make a stab at providing a PR with a fix.