kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.38k stars 39.74k forks source link

Allow scaling up to meet PDB constraints #93476

Open howardjohn opened 4 years ago

howardjohn commented 4 years ago

What would you like to be added:

Currently if you have a PDB spec like minAvailable: 1, and a HPA defining minReplicas=1 and maxReplicas=N, you may end up in a scenario where disruptions get "stuck" if the HPA has scaled to 1. At any point an increase in load could cause the HPA to scale up to 2+, allowing the PDB to be satisfied, which leads to a weird scenario where a disruption can only occur if there is high load.

Ideally, the PDB would be able to trigger the HPA to scale up to meet constraints. For example, if I have 1 replica and a disruption is triggered, scale up a new replica first, then terminate the old one. Given an HPA is in place, I have already clearly specified I am OK with my pod being scaled up, so this shouldn't have negative impact on stateful workloads.

Some more context: https://github.com/istio/istio/issues/12602

howardjohn commented 4 years ago

/sig autoscaling

howardjohn commented 4 years ago

It seems also that the Deployment.strategy.maxSurge will not impact this either. This seems like it should have an impact - if I allow a surge of 5 pods why not scale up then evict the draining node?

michaelgugino commented 4 years ago

If you are deploying a single replica of any particular pod, you're saying "It's okay if this pod is offline for some short time" as an involuntary disruption (eg, node failure) would result in this potentially happening.

So, you have two choices: Acknowledge it's okay to have 0 pods deployed for a short amount of time by removing the PDB, or adjust your replica count to ensure that you're always at N+1, and set PDBs appropriately.

howardjohn commented 4 years ago

Acknowledge it's okay to have 0 pods deployed for a short amount of time by removing the PDB

Wouldn't removing the PDB mean we can have voluntary downtime, whereas with the PDB we will only have involuntary?

michaelgugino commented 4 years ago

Wouldn't removing the PDB mean we can have voluntary downtime, whereas with the PDB we will only have involuntary?

This is true. But from my POV, by only having one replica, you're implicitly stating that losing that application for some length of time is a non-critical event for your cluster. Set an alert if the replica is down for a protracted period of time.

One approach to minimizing downtime for voluntary disruptions is to set an appropriately timed gracePeriod. The pod will be marked deleted, the replicaset will spin you up a new one. This wouldn't be ideal for stateful sets, but again, for those you should really have N+1.

Crayeth commented 4 years ago

Ideally, the PDB would be able to trigger the HPA to scale up to meet constraints. For example, if I have 1 replica and a disruption is triggered, scale up a new replica first, then terminate the old one. Given an HPA is in place, I have already clearly specified I am OK with my pod being scaled up, so this shouldn't have negative impact on stateful workloads.

this. I don't want to have to run two copies of everything just so KureD can do its thing that is mighty expensive. I understand that in case of unplanned disruption my single instance will still go down but that is within my acceptable risk and a tradeoff im willing to make. But planned disruptions should not make it go down, or, prevent KureD from doing node reboots (because it would violate the disruption budget).

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

howardjohn commented 3 years ago

/remove-lifecycle stale

michaelgugino commented 3 years ago

Ideally, the PDB would be able to trigger the HPA to scale up to meet constraints. For example, if I have 1 replica and a disruption is triggered, scale up a new replica first, then terminate the old one. Given an HPA is in place, I have already clearly specified I am OK with my pod being scaled up, so this shouldn't have negative impact on stateful workloads.

this. I don't want to have to run two copies of everything just so KureD can do its thing that is mighty expensive. I understand that in case of unplanned disruption my single instance will still go down but that is within my acceptable risk and a tradeoff im willing to make. But planned disruptions should not make it go down, or, prevent KureD from doing node reboots (because it would violate the disruption budget).

Assuming your component has a suitable grace period timeout, it's not like eviction kill -9's the pod. For planned maintenance, the pod will be marked deleted and come up elsewhere.

With proper priority and grace period, I'm 100% confident something like KureD can run with a single replica without a PDB and the cluster administrator will never notice the difference.

simonwgill commented 3 years ago

@howardjohn if you're still facing this issues, I did have to think about this a bit. A potential solution is to use an extra metric.

If multiple metrics are specified in a HorizontalPodAutoscaler, this calculation is done for each metric, and then the largest of the desired replica counts is chosen. If any of these metrics cannot be converted into a desired replica count (e.g. due to an error fetching the metrics from the metrics APIs) and a scale down is suggested by the metrics which can be fetched, scaling is skipped. This means that the HPA is still capable of scaling up if one or more metrics give a desiredReplicas greater than the current value.

Use a cluster-wide metric that can be set before the job runs, and a target value in the autoscaler such that cluster-metric/target = minimum number of replicas to allow for 1 disruption.

In this case, where there's a need to deploy a second replica before the old one can get removed, the cluster-wide metric is binary 0 or 1, and the target metric is 0.5.

At any point you want to move around pods in this fashion, just need to set that cluster-wide metric to 1, everything should scale up, then scale back down appropriately when the operation is finished.

Not tested. If you haven't found another solution, it might be worth considering. Thoughts?

howardjohn commented 3 years ago

@simonwgill unfortunately that would not help us, as we are shipping a software to many users, not deploying it ourselves, so we cannot control things like this. In general cluster scaling operations tend to be fairly hands-off and automatic for some users as well

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

olivierboudet commented 3 years ago

I also encounters this kind of issue but without the use of HPA. I have some pods with replicaCount=1 because I am OK to have a downtime in case of temporary and unplanned failures but I want to avoid downtime during a planned node pool upgrade. If I do a node upgrade, I expect k8s starts a new pod to be able to drain the old node and satisfying the PDB.

Basically, I have to manually do a 'kubectl rollout restart deploy' to drain a node and allows the PDB to be satisfied.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

howardjohn commented 3 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1084767746): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Tristan971 commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@Tristan971: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1179629719): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Tristan971 commented 2 years ago

Weird call to disregard this issue like this... Seems extremely counter-intuitive, at the very least

ldemailly commented 2 years ago

can we please reopen this. it's a very valid use case and seems like a bug in k8s

I should be able to set that I want 1 pod most time yet bring up the replacement pod in case of kubectl drain instead of getting stuck in a loop with

error when evicting pods/... (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

I can't think of a workaround, sadly, to achieve this rather simple mode (I would have expected indeed that either maxsurge or hpa or pdb would ensure this)

or is a workaround to not have a pdb, and put in the preStop something that artificially changes min pod to 2?

t-l-k commented 2 years ago

I have a healthcheck on a service for a pod in a deployment of replicas=1. Unplanned outages I want the healthcheck to alert, as it currently does. Planned node pool upgrades I'd like the deployment equivalent to Rolling Upgrade's maxSurge=1, to satisfy disruption budget minAvailable=1, and the healthcheck not to alert.

I also would prefer that node upgrades not fail if due to violating PDBs given the above criteria. I can tolerate short term absence of the pod, but running more than 1 pod for the workload is a waste of compute, and writing mutex code inside the container process to ensure it doesn't duplicate the work of the other replica would be a waste of effort, as would having to temporarily scale any deployments with e.g. --replicas=2.

Does feel like a missing feature.

howardjohn commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@howardjohn: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1284399792): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 2 years ago

@howardjohn: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1320426649): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
howardjohn commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@howardjohn: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1361665332): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1398674624): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
howardjohn commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@howardjohn: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1398682144): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
h0jeZvgoxFepBQ2C commented 1 year ago

Bump, this would be really useful and also save us really a LOT of money, since we have keep 2 instances running whereas we only need only one!

WJay-tec commented 1 year ago

Bump, from the perspective of cost savings, being able to run a single replica for our staging and development workload while maintaining availability during a node drain is extremely useful. It should temporarily increase the replica count when it encounters PDB violation.

Pairing this with AWS Spot Instance which faces occasional eviction would be an amazing cost saving initiative, while not sacrificing availability.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1500432266): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
h0jeZvgoxFepBQ2C commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@h0jeZvgoxFepBQ2C: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1500435023): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
nikopavlica commented 1 year ago

Our use case for this feature would be dev/review environments. I don't want to have waste by having multiple pods lying around, just so we can survive various autoscalers like karpenter constantly moving pods around during its consolidation phases.

I think karpenter is especially problematic with this, since its consolidation algorithm is quite aggressive and will move pods around a lot.

We could easily tolerate outage on hardware error, but outage due to autoscaler doing its thing is just a big fail.

h0jeZvgoxFepBQ2C commented 1 year ago

@howardjohn could you reopen this issue?

howardjohn commented 1 year ago

I can but without an owner its likely to get closed again

/reopen

k8s-ci-robot commented 1 year ago

@howardjohn: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93476#issuecomment-1572180586): >I can but without an owner its likely to get closed again > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cen1 commented 1 year ago

Keep the fire alive 🙏

pbetkier commented 1 year ago

I think I understand the problem, but I'm skeptical if HPA is the right level to solve this. As I understand it, HPA job is simple: decide how many replicas should run to handle the load. Mitigating planned node disruptions seems like a different responsibility to me. Does the problem with blocked planned node disruptions only happen with HPA? Isn't it a general issue whenever running a 1 replica deployment?

Adding SIG Node to weigh in.

/sig node

Tristan971 commented 1 year ago

I'm skeptical if HPA is the right level to solve this

This is indeed not entirely HPA-specific, but the HPA forcing "back" to 1 any temporary scale-up to 2 (whether manual or driven by another loop, like trying to satisfy a PDB during some pertubation like a node eviction) is the HPA-specific portion of it

Isn't it a general issue whenever running a 1 replica deployment?

Yes and no; in this case what's "surprising" is that the HPA is, as mentioned in the op:

defining minReplicas=1 and maxReplicas=N

with N > 1. It merely becomes problematic because an HPA's scaling decision is enforced nearly like a static value of replicas on the replicaset (which makes sense in general).

If one sets replica: 1 manually on a deployment, the argument of "well, you implicitly accept downtime here" is quite a bit stronger, imo.

To be honest it's not so easy to fault anyone for it, and it's quite tricky to decide on what the right behaviour is. But it is a common-enough problem in cases like dev environments that it's a pain point.

sftim commented 1 year ago

Maybe the first step is to add a way for to label ReplicaSets (or maybe Pods?) so that scale-in is temporarily suspended.

If we have that label, we can teach HPA and friends to honor it.

froblesmartin commented 1 year ago

I am not sure if it is the generic and only use case, but for me, this is a limitation when draining a node.

Could the draining process be the one in charge to also orchestrate this? Increasing the replicas temporarily to 2, so that the deletion of the original pod succeeds, and then going back to 1 replica, or just allowing the HPA to do its job?