Pod throttling - Githubissues

buckleyGI commented 1 year ago

I came across Trimaran when looking for a solution to our pressing problem (still not solved) when we bring back a great number of pods after a blackout period.

Let me try to explain our situation. At night we scale all deployments to 0 replicas (and store the original value in an annotation) to reduce costs. In the morning the replicas are restored to their original values. This is where the problem starts as they overburden the node where the CPU plateaus to 100%. The warmup probes start to fail and, if they make it, the liveness probes will start restaring the pods. Too many pods are trying to get up at the same time... After about 2 hours the system finally stabilizes. If we startup the pods manually, say 5 per minute, the system comes up in less then 10 minutes (!). In other domains this is called the thundering herd problem or stampede problem and we haven't found a fix yet for Kubernetes.

Is this where Trimaran can help us? I understand that Trimaran takes into account the CPU utilization but its goal is to be more efficient and not throttle pod creation per se right. Will we be abusing Trimaran for our use case or is it a good fit you think?

If not it is time for me to learn Go and the Scheduling Framework as it can't be too hard to install a throttle for pod creation on a node where x pods can be created in a sliding window of x minutes. Am I on the right track? Thank you.

Huang-Wei commented 1 year ago

cc @wangchen615 @zorro786 @atantawi

atantawi commented 1 year ago

@buckleyGI Need some clarification about the configuration. Are the values of requests and limits for CPU same? or different (values?)? Are there init containers in the pods? What are the requests and limits for those?

everpeace commented 1 year ago

FYI: I received an almost identical issue in https://github.com/everpeace/kube-throttler/issues/57. And I proposed two solutions that can throttle the number of starting up pods in the cluster by kube-throttler or Pod Scheduling Readiness. See https://github.com/everpeace/kube-throttler/issues/57#issuecomment-1354719952

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/scheduler-plugins/issues/453#issuecomment-1555655294): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / scheduler-plugins

Pod throttling #453