Pods are pending for such a longtime when HPA scales

kaviarasan-ex2 commented 3 years ago

Hi there, Seeking experts advise here and below are the details.

Configuration:

CA with priority expander strategy
Max node provision time set to 5m0s
AWS cloud & kops cluster
2 node groups are configured as part of the CA. node group A (min 2 & max 7 instances), node group B (min 0 & max 7 instances)
When node group A couldn't provision an instance in the given period of time it would fall back to node group B
HPA configuration min 5 to max 50 replicas as per the requirement

Observation:

Whenever the application / pods metrics scaled the HPA it scales the nodes in node group A if not then node group B as appropriate without any issues
Predication for the HPA scaling and all is about 20% only per day but whenever it scales it will scale upto the maximum number of replicas or nodes. It's configured in such a way based on the requirement & analysis.
Provisioning the instance in the cloud provider takes around 3 to 4 minutes and on top of that our pod / application also takes around 2 to 3 minutes of the time to become into running state

what we tested:

We tested with the paused container resources overprovisioning approach by defining the priority class for the paused container. However it's solving our issue to certain extent but not completely. Let's say when the HPA scales, it will scale upto 7 number of instances at a time since there is paused container running 1 or 2 replicas already on 1 or 2 instances on the cluster. It's accommodating the actual application pods to certain extent immediately by terminating the paused containers (Only around 3 minutes of time it's taking to have the services running) without any delay on the nodes provisioning time but for the other pods it's taking upto 6 minutes of time to become running as it's required additional required instances to be provisioned and joining the cluster. This is something we're not able to rectify the delay completely.

What are we looking for: Any suggestion or recommendation can be provided on this to solve how do we can do a proactive scaling and avoid the around 3 minutes of delay for the additional nodes to be provisioned when the scaling happens. We would request your response as early as possible. Please let us know if any additional information's are required.

kaviarasan-ex2 commented 3 years ago

Any update on this please ?

MaciekPytel commented 3 years ago

Cluster Autoscaler does reactive scaling, it doesn't have any predictive capability. If the time it takes to boot up instances is too long for your case, you need to have those instances ready before the scale-up is needed.

If you can predict the spike before it happens, you can build a bunch of different solutions. The common patterns I've seen are:

Create a custom metric that will go up when you expect the spike (ex. at specific time of the day) and use it alongside cpu for your HPA.
Create pause pods before the spike hits (this only works if you create enough of them for all your spike pods).

All of this is largely DIY and relies on you providing the logic that can predict the spike. There is no support for predictive autoscaling in Kubernetes. There may be some projects on github you may be able to leverage, but they're not developed by Kubernetes sig-autoscaling and I have no experience with them.

kaviarasan-ex2 commented 3 years ago

Thanks for your response!

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4038#issuecomment-944221824): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

Pods are pending for such a longtime when HPA scales #4038