Multiple scale-ups for pods with volumes

georgebuckerfield commented 2 years ago

Which component are you using?:

cluster-autoscaler (with the priority expander)

What version of the component are you using?:

Component version: 1.21.0 (but I'm seeing the same behaviour with 1.23.0)

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:17:57Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS EKS

What did you expect to happen?:

When an unschedulable pod with a persistent volume triggers a scale-up, there should only be one scale-up for that pod.

What happened instead?:

There are two scale-ups for the pod.

How to reproduce it (as minimally and precisely as possible):

The behaviour we're seeing is this:

I have a pod (web-1 in these logs), which uses a persistent volume in eu-west-1a
There is one EKS managed node group for that zone (test-ebs-scaling-test-t-medium-spot-eu-west-1a)
The underlying autoscaling group for the managed node group has been tagged with the k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone tag, with the value set to eu-west-1a
We are using the priority expander
We have the --balancing-ignore-label="[topology.ebs.csi.aws.com/zone]" flag set

When the pod first becomes unschedulable, everything works as expected:

I0223 20:14:32.373716       1 scheduler_binder.go:737] PVC "default/www-web-1" is fully bound to PV "pvc-db8dc3b5-1b77-4337-946a-af294969552c"
I0223 20:14:32.373738       1 csi.go:85] Could not get a CSINode object for the node: csinode.storage.k8s.io "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e5
3-0e8b-eb83bb976dff-5660362920598436033" not found
I0223 20:14:32.373752       1 scheduler_binder.go:266] FindPodVolumes for pod "default/web-1", node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb
83bb976dff-5660362920598436033"
I0223 20:14:32.373763       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb8
3bb976dff-5660362920598436033": csinode.storage.k8s.io "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-5660362920598436033" not found
I0223 20:14:32.373788       1 scheduler_binder.go:826] PersistentVolume "pvc-db8dc3b5-1b77-4337-946a-af294969552c", Node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf9
3a6-e1fa-9e53-0e8b-eb83bb976dff-5660362920598436033" matches for Pod "default/web-1"
I0223 20:14:32.373799       1 scheduler_binder.go:829] All bound volumes for Pod "default/web-1" match with Node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa
-9e53-0e8b-eb83bb976dff-5660362920598436033"

And eventually:

I0223 20:14:32.375255       1 priority.go:118] Successfully loaded priority configuration from configmap.
I0223 20:14:32.375271       1 priority.go:167] priority expander: eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff chosen as the highest available
I0223 20:14:32.375280       1 scale_up.go:468] Best option to resize: eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff
I0223 20:14:32.375287       1 scale_up.go:472] Estimated 1 nodes needed in eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff
I0223 20:14:32.375428       1 scale_up.go:655] No info about pods passing predicates found for group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1b-90bf93a6-e976-6a79-388a-e0a5209606cc, skippi
ng it from scale-up consideration
I0223 20:14:32.375432       1 scale_up.go:655] No info about pods passing predicates found for group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1c-86bf93a6-e189-b338-c9bd-96780d1bf225, skippi
ng it from scale-up consideration
I0223 20:14:32.375441       1 scale_up.go:586] Final scale-up plan: [{eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff 2->3 (max: 10)}]
I0223 20:14:32.375454       1 scale_up.go:675] Scale-up: setting group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff size to 3
I0223 20:14:32.375483       1 auto_scaling_groups.go:219] Setting asg eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff size to 3

While the node is starting, again everything looks fine:

I0223 20:15:03.466627       1 scheduler_binder.go:266] FindPodVolumes for pod "default/web-1", node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-6325490479461909585-0"
I0223 20:15:03.466643       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-6325490479461909585-0": csinode.storage.k8s.io "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-6325490479461909585-0" not found
I0223 20:15:03.466669       1 scheduler_binder.go:826] PersistentVolume "pvc-db8dc3b5-1b77-4337-946a-af294969552c", Node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-6325490479461909585-0" matches for Pod "default/web-1"
I0223 20:15:03.466682       1 scheduler_binder.go:829] All bound volumes for Pod "default/web-1" match with Node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-6325490479461909585-0"
I0223 20:15:03.466705       1 filter_out_schedulable.go:157] Pod default.web-1 marked as unschedulable can be scheduled on node template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff-6325490479461909585-0. Ignoring in scale up.
I0223 20:15:03.466725       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0223 20:15:03.466733       1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
I0223 20:15:03.466745       1 filter_out_schedulable.go:79] Schedulable pods present
I0223 20:15:03.466774       1 static_autoscaler.go:401] No unschedulable pods

But once the new node (ip-172-22-160-90.eu-west-1.compute.internal) has joined the cluster and is Ready, there is an issue:

I0223 20:16:03.880930       1 clusterstate.go:248] Scale up in group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff finished successfully in 1m30.8249116s
I0223 20:16:03.880987       1 filter_out_schedulable.go:65] Filtering out schedulables
I0223 20:16:03.880997       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0223 20:16:03.881083       1 scheduler_binder.go:737] PVC "default/www-web-1" is fully bound to PV "pvc-db8dc3b5-1b77-4337-946a-af294969552c"
I0223 20:16:03.881123       1 scheduler_binder.go:266] FindPodVolumes for pod "default/web-1", node "ip-172-22-160-90.eu-west-1.compute.internal"
I0223 20:16:03.881140       1 scheduler_binder.go:823] PersistentVolume "pvc-db8dc3b5-1b77-4337-946a-af294969552c", Node "ip-172-22-160-90.eu-west-1.compute.internal" mismatch for Pod "default/web-1": no matching NodeSelectorTerms
I0223 20:16:03.881155       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0223 20:16:03.881159       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0223 20:16:03.881176       1 filter_out_schedulable.go:82] No schedulable pods
I0223 20:16:03.881184       1 klogx.go:86] Pod default/web-1 is unschedulable
I0223 20:16:03.881225       1 scale_up.go:376] Upcoming 0 nodes

And we start the scale up loop again:

I0223 20:16:03.881777       1 scheduler_binder.go:826] PersistentVolume "pvc-db8dc3b5-1b77-4337-946a-af294969552c", Node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf9
3a6-e1fa-9e53-0e8b-eb83bb976dff-6179999080186222552" matches for Pod "default/web-1"
I0223 20:16:03.881783       1 scheduler_binder.go:829] All bound volumes for Pod "default/web-1" match with Node "template-node-for-eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa
-9e53-0e8b-eb83bb976dff-6179999080186222552"

And the same node group is scaled up again:

I0223 20:16:03.882671       1 scale_up.go:468] Best option to resize: eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff
I0223 20:16:03.882678       1 scale_up.go:472] Estimated 1 nodes needed in eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff
I0223 20:16:03.882905       1 scale_up.go:655] No info about pods passing predicates found for group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1b-90bf93a6-e976-6a79-388a-e0a5209606cc, skippi
ng it from scale-up consideration
I0223 20:16:03.882911       1 scale_up.go:655] No info about pods passing predicates found for group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1c-86bf93a6-e189-b338-c9bd-96780d1bf225, skippi
ng it from scale-up consideration
I0223 20:16:03.882920       1 scale_up.go:586] Final scale-up plan: [{eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff 3->4 (max: 10)}]
I0223 20:16:03.882931       1 scale_up.go:675] Scale-up: setting group eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff size to 4
I0223 20:16:03.882959       1 auto_scaling_groups.go:219] Setting asg eks-test-ebs-scaling-test-t-medium-spot-eu-west-1a-aabf93a6-e1fa-9e53-0e8b-eb83bb976dff size to 4
I0223 20:16:04.103868       1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop

~10 seconds later, the autoscaler now sees that the pod can be scheduled on the node:

I0223 20:16:14.129175       1 scheduler_binder.go:266] FindPodVolumes for pod "default/web-1", node "ip-172-22-160-90.eu-west-1.compute.internal"
I0223 20:16:14.129202       1 scheduler_binder.go:826] PersistentVolume "pvc-db8dc3b5-1b77-4337-946a-af294969552c", Node "ip-172-22-160-90.eu-west-1.compute.internal" matches for Pod "default/web-
1"
I0223 20:16:14.129218       1 scheduler_binder.go:829] All bound volumes for Pod "default/web-1" match with Node "ip-172-22-160-90.eu-west-1.compute.internal"
I0223 20:16:14.129240       1 filter_out_schedulable.go:157] Pod default.web-1 marked as unschedulable can be scheduled on node ip-172-22-160-90.eu-west-1.compute.internal. Ignoring in scale up.
I0223 20:16:14.129265       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0223 20:16:14.129272       1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
I0223 20:16:14.129285       1 filter_out_schedulable.go:79] Schedulable pods present
I0223 20:16:14.129313       1 static_autoscaler.go:401] No unschedulable pods

But by this point the additional scale-up is already happening.

My assumption is that this is a race condition between autoscaler evaluating the new node and the EBS CSI driver adding the topology.ebs.csi.aws.com/zone label? If I add the topology.ebs.csi.aws.com/zone label statically to the node group so that instances have it as soon as they are started the problem goes away. But that feels like the incorrect way of doing things.

Am I missing something obvious? Or perhaps we're configuring the autoscaler slightly incorrectly? Any suggestions of things to try would be really appreciated.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4712#issuecomment-1194063408): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

bhperry commented 9 months ago

Having this exact same issue! Very frustrating, because as far as I can tell the nodes get labeled right away, but the cluster-autoscaler still complains about it

bhperry commented 9 months ago

/reopen

k8s-ci-robot commented 9 months ago

@bhperry: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4712#issuecomment-1922550986): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

Multiple scale-ups for pods with volumes #4712