[AWS] failed to scale up ASG from 0 when security group for pods is enabled

cloudcarver commented 1 year ago

Which component are you using?:

cluster-autoscaler cloud provider: AWS

What version of the component are you using?:

[AWS VPC CNI]: v1.12.6-eksbuild.2 [Cluster Autoscaler]: registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.2

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.1-eks-2f008fe", GitCommit:"abfec7d7e55d56346a5259c9379dea9f56ba2926", GitTreeState:"clean", BuildDate:"2023-04-14T20:43:13Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.6-eks-a5565ad", GitCommit:"895ed80e0cdcca657e88e56c6ad64d4998118590", GitTreeState:"clean", BuildDate:"2023-06-16T17:34:03Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS EKS ap-southeast-1

What did you expect to happen?:

Before this bug is found, I can successfully create pods with security groups without any problems. Everything worked pretty well and the feature is also shipped to the production environment.

Then I tried to create a new node group and use affinity to try to schedule some workloads to this node group. The pods of the workload are pending forever.

What happened instead?:

I got events like the following:

Normal   NotTriggerScaleUp  67s (x17 over 4m20s)  cluster-autoscaler  (combined from similar events): 
pod didn't trigger scale-up: 36 Insufficient vpc.amazonaws.com/pod-eni, 4 node(s) had untolerated taint {node_group: spot}, 1 node(s) had untolerated taint {node_type: monitoring}, 36 node(s) didn't match Pod's node affinity/selector

How to reproduce it (as minimally and precisely as possible):

Create a special node group and use affinity to make sure a special workload will only be scheduled to this node group. Then when tryting to create pods of this workload, the autoscaler will try to scale up the corresponding node group from 0. The predicate check of the node group failed in "simulator based scheduler" because the converted NodeInfo of the corresponding ASG does not contain vpc.amazonaws.com/pod-eni field in its capacity.

Anything else we need to know?:

This is FIXED by adding a tag to all ASGs:

k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/pod-eni: 6

But clearly there should be a better way 🤔

Shubham82 commented 1 year ago

/area provider/aws

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Shubham82 commented 9 months ago

/remove-lifecycle stale

QustodioPablo commented 8 months ago

Whats the value that should be used for the k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/pod-eni tag? The number of Network Interfaces the instance can have? Or number of Network Interfaces -1? Since one of the ENIs would be used as the trunk ENI

cloudcarver commented 7 months ago

Whats the value that should be used for the k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/pod-eni tag? The number of Network Interfaces the instance can have? Or number of Network Interfaces -1? Since one of the ENIs would be used as the trunk ENI

Currently, you can put any value in it to start a node from 0. Then the autoscaler will make it correct as it is a dynamic value for measuring the current capacity of a node.

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Shubham82 commented 3 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 4 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes / autoscaler

[AWS] failed to scale up ASG from 0 when security group for pods is enabled #6021