kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.09k stars 3.98k forks source link

cluster-autoscaler gets stuck with "Some nodes that failed to create were removed" #6601

Open daimaxiaxie opened 8 months ago

daimaxiaxie commented 8 months ago

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.28.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:25:59Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.3-5", GitCommit:"dcc97265743078854c5328e30727147bdc5d1c37", GitTreeState:"clean", BuildDate:"2020-12-04T03:52:29Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

aws via the aws cloud provider.

What did you expect to happen?:

I expect cluster-autoscaler to be able to scale ASGs up/down without error.

What happened instead?:

cluster-autoscaler is stuck main loop with the following error:

I0304 08:20:40.253243       1 static_autoscaler.go:287] Starting main loop
I0304 08:20:40.539180       1 mixed_nodeinfos_processor.go:205] GetNodeInfosForGroups: setup node infos for group size: 41
I0304 08:20:40.549177       1 clusterstate.go:1081] Found 5 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup spot-32c128g-b
I0304 08:20:40.549203       1 clusterstate.go:1099] Failed adding 5 nodes (0 unseen previously) to group spot-32c128g-b due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
I0304 08:20:40.549223       1 clusterstate.go:1081] Found 3 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup spot-32c128g-c
I0304 08:20:40.549242       1 clusterstate.go:1099] Failed adding 3 nodes (0 unseen previously) to group spot-32c128g-c due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
I0304 08:20:40.549392       1 static_autoscaler.go:859] Deleting 5 from spot-32c128g-b node group because of create errors
I0304 08:20:40.549406       1 static_autoscaler.go:859] Deleting 3 from spot-32c128g-c node group because of create errors
I0304 08:20:40.549417       1 static_autoscaler.go:438] Some nodes that failed to create were removed, skipping iteration
I0304 08:20:55.569411       1 static_autoscaler.go:287] Starting main loop
I0304 08:20:55.863555       1 mixed_nodeinfos_processor.go:205] GetNodeInfosForGroups: setup node infos for group size: 41
I0304 08:20:55.873638       1 clusterstate.go:1081] Found 5 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup spot-32c128g-b
I0304 08:20:55.873666       1 clusterstate.go:1099] Failed adding 5 nodes (0 unseen previously) to group spot-32c128g-b due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
I0304 08:20:55.873687       1 clusterstate.go:1081] Found 3 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup spot-32c128g-c
I0304 08:20:55.873697       1 clusterstate.go:1099] Failed adding 3 nodes (0 unseen previously) to group spot-32c128g-c due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
I0304 08:20:55.873842       1 static_autoscaler.go:859] Deleting 5 from spot-32c128g-b node group because of create errors
I0304 08:20:55.873855       1 static_autoscaler.go:859] Deleting 3 from spot-32c128g-c node group because of create errors
I0304 08:20:55.873865       1 static_autoscaler.go:438] Some nodes that failed to create were removed, skipping iteration

How to reproduce it (as minimally and precisely as possible):

The spot instance in an asg is recycled and cannot be expanded. The number of remaining instances is less than minsize.

Anything else we need to know?:

When instances in an asg are recycled, the remaining instances less than MinSize. deleteCreatedNodesWithErrors will always cause skip iteration. Other normal asg alway cannot scale up.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Shubham82 commented 3 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Shubham82 commented 3 weeks ago

/remove-lifecycle stale