Open howardjohn opened 4 years ago
/sig autoscaling
It seems also that the Deployment.strategy.maxSurge will not impact this either. This seems like it should have an impact - if I allow a surge of 5
pods why not scale up then evict the draining node?
If you are deploying a single replica of any particular pod, you're saying "It's okay if this pod is offline for some short time" as an involuntary disruption (eg, node failure) would result in this potentially happening.
So, you have two choices: Acknowledge it's okay to have 0 pods deployed for a short amount of time by removing the PDB, or adjust your replica count to ensure that you're always at N+1, and set PDBs appropriately.
Acknowledge it's okay to have 0 pods deployed for a short amount of time by removing the PDB
Wouldn't removing the PDB mean we can have voluntary downtime, whereas with the PDB we will only have involuntary?
Wouldn't removing the PDB mean we can have voluntary downtime, whereas with the PDB we will only have involuntary?
This is true. But from my POV, by only having one replica, you're implicitly stating that losing that application for some length of time is a non-critical event for your cluster. Set an alert if the replica is down for a protracted period of time.
One approach to minimizing downtime for voluntary disruptions is to set an appropriately timed gracePeriod. The pod will be marked deleted, the replicaset will spin you up a new one. This wouldn't be ideal for stateful sets, but again, for those you should really have N+1.
Ideally, the PDB would be able to trigger the HPA to scale up to meet constraints. For example, if I have 1 replica and a disruption is triggered, scale up a new replica first, then terminate the old one. Given an HPA is in place, I have already clearly specified I am OK with my pod being scaled up, so this shouldn't have negative impact on stateful workloads.
this. I don't want to have to run two copies of everything just so KureD can do its thing that is mighty expensive. I understand that in case of unplanned disruption my single instance will still go down but that is within my acceptable risk and a tradeoff im willing to make. But planned disruptions should not make it go down, or, prevent KureD from doing node reboots (because it would violate the disruption budget).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Ideally, the PDB would be able to trigger the HPA to scale up to meet constraints. For example, if I have 1 replica and a disruption is triggered, scale up a new replica first, then terminate the old one. Given an HPA is in place, I have already clearly specified I am OK with my pod being scaled up, so this shouldn't have negative impact on stateful workloads.
this. I don't want to have to run two copies of everything just so KureD can do its thing that is mighty expensive. I understand that in case of unplanned disruption my single instance will still go down but that is within my acceptable risk and a tradeoff im willing to make. But planned disruptions should not make it go down, or, prevent KureD from doing node reboots (because it would violate the disruption budget).
Assuming your component has a suitable grace period timeout, it's not like eviction kill -9's the pod. For planned maintenance, the pod will be marked deleted and come up elsewhere.
With proper priority and grace period, I'm 100% confident something like KureD can run with a single replica without a PDB and the cluster administrator will never notice the difference.
@howardjohn if you're still facing this issues, I did have to think about this a bit. A potential solution is to use an extra metric.
If multiple metrics are specified in a HorizontalPodAutoscaler, this calculation is done for each metric, and then the largest of the desired replica counts is chosen. If any of these metrics cannot be converted into a desired replica count (e.g. due to an error fetching the metrics from the metrics APIs) and a scale down is suggested by the metrics which can be fetched, scaling is skipped. This means that the HPA is still capable of scaling up if one or more metrics give a desiredReplicas greater than the current value.
Use a cluster-wide metric that can be set before the job runs, and a target value in the autoscaler such that cluster-metric/target = minimum number of replicas to allow for 1 disruption.
In this case, where there's a need to deploy a second replica before the old one can get removed, the cluster-wide metric is binary 0 or 1, and the target metric is 0.5.
At any point you want to move around pods in this fashion, just need to set that cluster-wide metric to 1, everything should scale up, then scale back down appropriately when the operation is finished.
Not tested. If you haven't found another solution, it might be worth considering. Thoughts?
@simonwgill unfortunately that would not help us, as we are shipping a software to many users, not deploying it ourselves, so we cannot control things like this. In general cluster scaling operations tend to be fairly hands-off and automatic for some users as well
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I also encounters this kind of issue but without the use of HPA. I have some pods with replicaCount=1 because I am OK to have a downtime in case of temporary and unplanned failures but I want to avoid downtime during a planned node pool upgrade. If I do a node upgrade, I expect k8s starts a new pod to be able to drain the old node and satisfying the PDB.
Basically, I have to manually do a 'kubectl rollout restart deploy' to drain a node and allows the PDB to be satisfied.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
/reopen
@Tristan971: You can't reopen an issue/PR unless you authored it or you are a collaborator.
Weird call to disregard this issue like this... Seems extremely counter-intuitive, at the very least
can we please reopen this. it's a very valid use case and seems like a bug in k8s
I should be able to set that I want 1 pod most time yet bring up the replacement pod in case of kubectl drain
instead of getting stuck in a loop with
error when evicting pods/... (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I can't think of a workaround, sadly, to achieve this rather simple mode (I would have expected indeed that either maxsurge or hpa or pdb would ensure this)
or is a workaround to not have a pdb, and put in the preStop something that artificially changes min pod to 2?
I have a healthcheck on a service for a pod in a deployment of replicas=1
. Unplanned outages I want the healthcheck to alert, as it currently does. Planned node pool upgrades I'd like the deployment equivalent to Rolling Upgrade's maxSurge=1
, to satisfy disruption budget minAvailable=1
, and the healthcheck not to alert.
I also would prefer that node upgrades not fail if due to violating PDBs given the above criteria. I can tolerate short term absence of the pod, but running more than 1 pod for the workload is a waste of compute, and writing mutex code inside the container process to ensure it doesn't duplicate the work of the other replica would be a waste of effort, as would having to temporarily scale any deployments with e.g. --replicas=2
.
Does feel like a missing feature.
/reopen
@howardjohn: Reopened this issue.
@howardjohn: This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@howardjohn: Reopened this issue.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@howardjohn: Reopened this issue.
Bump, this would be really useful and also save us really a LOT of money, since we have keep 2 instances running whereas we only need only one!
Bump, from the perspective of cost savings, being able to run a single replica for our staging and development workload while maintaining availability during a node drain is extremely useful. It should temporarily increase the replica count when it encounters PDB violation.
Pairing this with AWS Spot Instance which faces occasional eviction would be an amazing cost saving initiative, while not sacrificing availability.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@h0jeZvgoxFepBQ2C: You can't reopen an issue/PR unless you authored it or you are a collaborator.
Our use case for this feature would be dev/review environments. I don't want to have waste by having multiple pods lying around, just so we can survive various autoscalers like karpenter constantly moving pods around during its consolidation phases.
I think karpenter is especially problematic with this, since its consolidation algorithm is quite aggressive and will move pods around a lot.
We could easily tolerate outage on hardware error, but outage due to autoscaler doing its thing is just a big fail.
@howardjohn could you reopen this issue?
I can but without an owner its likely to get closed again
/reopen
@howardjohn: Reopened this issue.
Keep the fire alive 🙏
I think I understand the problem, but I'm skeptical if HPA is the right level to solve this. As I understand it, HPA job is simple: decide how many replicas should run to handle the load. Mitigating planned node disruptions seems like a different responsibility to me. Does the problem with blocked planned node disruptions only happen with HPA? Isn't it a general issue whenever running a 1 replica deployment?
Adding SIG Node to weigh in.
/sig node
I'm skeptical if HPA is the right level to solve this
This is indeed not entirely HPA-specific, but the HPA forcing "back" to 1 any temporary scale-up to 2 (whether manual or driven by another loop, like trying to satisfy a PDB during some pertubation like a node eviction) is the HPA-specific portion of it
Isn't it a general issue whenever running a 1 replica deployment?
Yes and no; in this case what's "surprising" is that the HPA is, as mentioned in the op:
defining minReplicas=1 and maxReplicas=N
with N > 1. It merely becomes problematic because an HPA's scaling decision is enforced nearly like a static value of replicas on the replicaset (which makes sense in general).
If one sets replica: 1 manually on a deployment, the argument of "well, you implicitly accept downtime here" is quite a bit stronger, imo.
To be honest it's not so easy to fault anyone for it, and it's quite tricky to decide on what the right behaviour is. But it is a common-enough problem in cases like dev environments that it's a pain point.
Maybe the first step is to add a way for to label ReplicaSets (or maybe Pods?) so that scale-in is temporarily suspended.
If we have that label, we can teach HPA and friends to honor it.
I am not sure if it is the generic and only use case, but for me, this is a limitation when draining a node.
Could the draining process be the one in charge to also orchestrate this? Increasing the replicas temporarily to 2, so that the deletion of the original pod succeeds, and then going back to 1 replica, or just allowing the HPA to do its job?
What would you like to be added:
Currently if you have a PDB spec like
minAvailable: 1
, and a HPA defining minReplicas=1 and maxReplicas=N, you may end up in a scenario where disruptions get "stuck" if the HPA has scaled to 1. At any point an increase in load could cause the HPA to scale up to 2+, allowing the PDB to be satisfied, which leads to a weird scenario where a disruption can only occur if there is high load.Ideally, the PDB would be able to trigger the HPA to scale up to meet constraints. For example, if I have 1 replica and a disruption is triggered, scale up a new replica first, then terminate the old one. Given an HPA is in place, I have already clearly specified I am OK with my pod being scaled up, so this shouldn't have negative impact on stateful workloads.
Some more context: https://github.com/istio/istio/issues/12602