Closed runningman84 closed 3 months ago
We've talked about this a fair bit -- I think it should be combined/collapsed w/ ttlSecondsAfterEmpty.
The challenge with this issue is more technical than anything. Computing ttlSecondsAfterEmpty is cheap, since we can cheaply compute empty nodes. Computing a consolidatable node requires a scheduling simulation across the rest of the cluster. Computing this for all nodes is really computationally expensive. We could potentially compute this once on the initial scan, and again once the TTL is about to expire. However, this can lead to weird scenarios like:
The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?
The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?
I'm a little unclear on this and I think it's in how I'm reading not in what you've said. What I think I am reading is that running the consolidatability on every single pod creation/deletion is to expensive. As an alternative the algorithm above is acceptable but in some cases could result in node consolidation in 'less than' TTLSecondsAfterConsolidatable due to fluctuation in cluster capacity between initial check (t0) and confirmation check/ (t0+30s in the example).
Have I understood correctly?
Yeah exactly. Essentially, the TTL wouldn't flip flop perfectly. We'd be taking a rough sample (rather than a perfect sample) of the data.
Thanks for the clarity. For my usage I'd not be concerned about the roughness of the sample. As long as there was a configurable time frame and the confirmation check needed to pass both times I'd be satisfied.
What I thought I wanted before being directed to this issue was to be able to specify how the consolidator was configured a bit like the descheduler project because I'm not really sure if the 'if it fits it sits' approach to scheduling is what I need in all cases.
Specifically, what behavior of descheduler did you want?
Generally I was looking for something like the deschedulerPolicy.strategies
config block which I generally interact through the helm values file.
More specifically I was looking for deschedulerPolicy.strategies.LowNodeUtalization.params.nodeResourceUtilizationThresholds
targetThresholds
and thresholds
.
To give another example of this need, I have a cluster that runs around 1500 pods - there are lots of pods coming in and out at any given moment. It would be great to be able to specify a consolidation cooldown period so that we are not constantly adding/removing nodes. Cluster Autoscaler has the flag --scale-down-unneeded-time
that helps with this scenario.
is it feature available yet?
We are facing same issue with high node rotation due too aggressive consolidation, would be nice to tune and control the behaviour, like minimum node ttl liveness, thresshold ttl since it's empty or underutilisation, merging nodes
cluster-autoscaler has other options too like:
--scale-down-delay-after-add, --scale-down-delay-after-delete, and --scale-down-delay-after-failure flag. E.g. --scale-down-delay-after-add=5m to decrease the scale down delay to 5 minutes after a node has been added.
I'm looking forward to something like scale-down-delay-after-add
to pair with consolidation. Our hourly cronjobs are also causing node thrashing.
Another couple of situations that currently lead to high node churn are:
In both situation above, we end up in situations where some workloads will end up being restarted multiple times within a short time frame due to node churn and if not enough replicas are configured with sufficient anti-affinity/skew, there is a chance for downtime to occur while pods become ready once again on new nodes.
It would be nice to be able to control the consolidation period, say every 24 hours or every week as described by the OP so it's less disruptive. Karpenter is doing the right thing though!
I suspect some workarounds could be:
Any other ideas or suggestions appreciated.
Adding here as another use case where we need better controls over consolidation, esp. around utilization. For us, there's a trade-off between utilization efficiency and disruptions caused by pod evictions. For instance, let's say I have 3 nodes, each utilized at 60%, so current behavior is Karpenter will consolidate down to 2 nodes at 90% capacity. But, in some cases, evicting the pods on the node to be removed is more harmful than achieving optimal utilization. It's not that these pods can't be evicted (for that we have the do-not-drain annotation) it's just that it's not ideal ... good example would be Spark executor pods that while they can recover from a restart, it's better if they are allowed to finish their work at the expense of some temporary inefficiency in node utilization.
CAS has the --scale-down-utilization-threshold
(along with the other flags mentioned) and seems like Karpenter needs a similar tunable. Unfortunately, we're seeing so much disruption in running pods b/c of consolidation that we can't use Karpenter in any of our active clusters.
@thelabdude can't your pods set terminationGracePeriodSeconds
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination ?
I'll have to think about termination grace period could help us but I wouldn't know what value to set and it would probably vary by workload ...
My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods.
I'm not saying a utilization threshold is the right answer for Karpenter but the current situation makes it hard to use in practice because we get too much pod churn due to consolidation and our nodes are never empty, so turning consolidation off isn't a solution either.
Hey @thelabdude, this is a good callout of core differences in CA's deprovisioning and Karpenter's deprovisioning. Karpenter intentionally has chosen to not use a threshold, as for any threshold you create, due to the heterogenous nature of pod resource requests, you can create un-wanted edge cases that constantly need to be fine-tuned.
For more info, ConsolidationTTL here would simply act as a waiting mechanism between consolidation actions, which you can read more about here. Since this would essentially just be a wait, this will simply slow down the time Karpenter takes to get to the end state as you've described. One idea that may help is if Karpenter allows some configuration of the cost-benefit analysis that Consolidation does. This would need to be framed as either cost or utilization, both tough to get right.
If you're able to in the meantime, you can set do-not-evict
on these pods you don't want consolidated, and you can also use the do-not-consolidate
node annotation as well. More here.
Are there any plans to implement or accept such a feature that adds some sort of time delay between node provisioning and consolidation? Perhaps based on the age of a node? The main advantage would be to increase stability during situations where there are surges in workload (scaling, scheduling, or roll outs).
Hey can you just add delay before start consolidation after prod's change?
You can add several delays:
This will help to run consolidation during low activity on cluster.
Also see issue https://github.com/aws/karpenter-core/issues/696: Exponential decay for cluster desired size
This comment suggests another approach we might consider.
My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods.
(from https://github.com/aws/karpenter-core/issues/735)
Elsewhere in Kubernetes, ReplicaSets can pay attention to a Pod deletion cost.
For Karpenter, we could have a Machine or Node level deletion cost, and possibly a contrib controller that raises that cost based on what is running there.
Imagine that you have a controller that detects when Pods are bound to a Node, and updates the node deletion cost based on some quality of the Pod. For example: if you have a Pod annotated as starts-up-slowly
, you set the node deletion cost for that node to 7 instead of the base value of 0. You'd also reset the value once the node didn't have any slow starting Pods.
We are in need of something like this, as well. Consolidation is too aggressive for our application rollouts, and is causing more issues and failures than make it worth the cost of running extra nodes.
Ideally, we'd like for Karpenter to have the capability to recognize it just added nodes, it shouldn't immediately be throwing more churn into the mix to deprovision nodes, especially before all pods that triggered the initial node creation are ready and available. Some options that would help:
10s
poll interval is, so that we can make the consolidation check more like once an hour - we're not in so much of a crunch that an hour of extra compute would bankrupt us.scale-down-delay-after-add
as discussed above, used in cluster-autoscaler, to force Karpenter to allow some time for everything to become healthy before removing nodesttlSecondsAfterUnderutilized
setting within the consolidation
configuration block, which would require Karpneter to make the first assessment that a node could be consolidated, and if after this ttl it still finds the same recommendation, then and only then would it work to consolidate that node. This means that if other activity occurs during that wait time (eg. pods added, removed, instance prices change, etc.) the evaluation may come to a different conclusion and the timer restarts. Yes, this means that a really high ttl or a high-churn cluster would struggle to actually have a consolidation take place, but if a user wants to configure this, then that is what they want -- they want consolidation to be less aggressive and occur less often -- so let them.As we're thinking about how to introduce better controls for consolidation, one of the questions we've come up against is whether or not Karpenter users care about having different TTLs or "knobs" for terminating under-utilized nodes compared to empty nodes.
React to this message with a 👍 if you'd prefer multiple, separate "knobs" for emptiness and under-utilization or use a 👎 if you'd like a single control for both. If you could also share a bit of detail about your use case below, that'd be even better!
Not a use case exactly but if Karpenter doesn't duplicate kube-scheduler, and that's by design, then I think I also wouldn't duplicate descheduler and alternatives. That was why I picked :-1:.
If we one day want to enable complex behavior for selecting when and how to delete only some Pods from a partially empty node, I'd ideally want to coordinate with the Kubernetes project. That coordination is to find and agree on a way to mark (annotate or label):
BTW Karpenter doesn't need a co-ordination point for it, but tainting a node that is due for removal means Pods shouldn't schedule there (if a whole load of unschedulable Pods turn up, Karpenter can always remove the taint and cancel an intended consolidation).
In that case of cancelled consolidation, I'm imagining that Karpenter also identifies the Pods labelled as pending eviction (for low node utilization) - and annotates the node to tell descheduler “wait, no, not the Pods on this node”.
Setting up those expectations will let cluster operators implement consolidation that fits their use case, by combining custom scheduling, custom descheduling, and custom [Karpenter] autoscaling. There are other designs such as a node deletion cost. Overall, I hope that we - Karpenter - find a way to play well with others for complex needs, and still meet the simple needs for cluster operators who are happy with the basic implementation.
Within Karpenter's domain - nodes and machines - it's fine to have customization because managing Nodes is what Karpenter is for. So, for example, Karpenter could wait some defined number of seconds after one consolidation operation before planning another. No objection to that delay, and it'd help manage hysteresis. Similarly, an are-you-sure period: “require Karpneter to make the first assessment that a node could be consolidated, and if after this ttl it still finds the same recommendation, then and only then would it work to consolidate that node“ sounds fine.
It's only when any of these knobs cross into the domain of scheduling and descheduling that I have concerns.
We have been really satisfied so far how karpenter behave in our smaller clusters and staging. Couple days ago I have updated first production cluster - one that really has customer traffic and scales. Karpenter works well enough but I found we have some number of deployments that tend to scale aggressively up and down, even during one hour. I am attaching graph with number of replicas. This is not the only deployment that behaves like this and they cause karpenter to add ~5 nodes several times per hour. Just to be removed couple minutes later and wait for another round. I asked the team if this scaling really works for them and the answer was that yes, this is fine for them. They can smooth it with some HPA/v2 features but not really needed - it would be for karpenter's sake.
I think in this case if the nodes waited for a while (configurable amount of seconds or minutes) in the cluster, it would lower the total % of the allocatable CPU utilised but it would also lower the total churn. Because we run overprovisioning deployment to give us some free capacity buffer, this is not disrupting cluster workloads that much but we want to get rid of overprovisioning to lower cluster costs - one of the reasons is that with karpenter the scaling is even faster so we don't need the extra buffer.
We are however stopping karpenter rollout into all production clusters to see how this big node churn affects cost of the cluster (is it even better than with CAS?) or already mentioned traffic cost.
I know there was similar use case like this in this issue but I thought I can support that by our use case. Picture shows number of replicas over time. These are the requests/limits for this deployment:
Limits:
memory: 12Gi
Requests:
cpu: 600m
memory: 2Gi
As we're thinking about how to introduce better controls for consolidation, one of the questions we've come up against is whether or not Karpenter users care about having different TTLs or "knobs" for terminating under-utilized nodes compared to empty nodes.
Our cluster that sees the most variation in size is primarily used for CI jobs. With this type of workload:
We burst up from effectively zero to $LARGE_NUMBER_FOR_US nodes, depending on the work days
(We also have a bunch of ML batch-y workloads that we want to use Karpenter more for that follow similar patterns.)
I think what would be most useful is something like aws/karpenter-core#696. That balance retaining capacity with having bursts of activity look like peaks instead of (expensive) plateaus. I don't particularly care if capacity is reduced from empty nodes or ones with low utilization, (Karpenter optimizing it out based on disruption budgets and cost sounds fine). So my answer for the "number of knobs" question is "as many as are needed to have something like exponential decaying capacity", but not more.
Just to add another issue that is caused by the fast scaledown.
In our particular use case there are a lot of organization wide aws config rules that get evaluated every time a node comes up.
So in those days where there are a lot of burst of CI jobs we end up paying as much for config
as we do for ec2
.
We reached a point where we're considering if keeping karpenter is still viable :disappointed:
We're primarily seeing this when rolling out a replacement of a large deployment, 200+ pods. Karpenter goes absolutely crazy during this scale out/in to the point where the AWS load balancer controller we use starts to run into reconciliation throttling due to massive amounts of movement during the deployment. New containers will end up on a new node that only lives for 5 minutes. We see many nodes come up for 5 or 10 minutes during this one rolling deploy, before things settle down. Sometimes it gets into a cycle where the rollout takes 30 minutes, where without consolidation on it would take 3.
We're considering doing a patch of the provisioner to temporarily disable consolidation right before the rollout starts, waiting for the deployment to normalize, and then turning consolidation back on. This feels really ugly, but I think it would work in our specific case.
I wonder if some kind of pattern for this would be useful. I can't think of a great interface off the top of my head, but some way to signal to "pause" consolidation for a period of time in a way that doesn't mess with the provisioner such that our helm charts containing the provisioner template could always cleanly apply. Maybe?
If we were able to make a consolidation happen when we decided, that would likely help us. We could run it outside working hours when we're shipping a lot.
The documentation in 0.32.1 for consolidateAfter
is very confusing. Specifically, it states:
ConsolidateAfter is the duration the controller will wait before attempting to terminate nodes that are underutilized.
But it's not compatible with WhenUnderutilized
! From the description, it sounds like it applies to "nodes that are underutilized" but is not the case.
Hello everyone! I do agree with @thelabdude. I was expecting to use consolidateAfter
along with WhenUnderutilized
.
Our use case is similar to the ones mentioned above. I would really like Karpenter to optimize underutilized workloads.
The problem is that this can't happen during our application deployment, otherwise Spinnaker gets lost during enable/disable traffic for the blue/green deployment. Spinnaker will try to add a label to a pod that doesn't exist anymore due to Karpenter deleting the nodes.
The consolidateAfter
works perfectly for this, but should be along with WhenUnderutilized
.
I hope that make sense. Do we have any plans to support this?
If you want to protect Pods during deployment, you could temporarily annotate them as exempt from consolidation @mullermateus. In other words, existing features let you consolidate underused nodes, and to protect Pods from eviction linked to that consolidation.
https://github.com/kubernetes-sigs/karpenter/issues/696 also proposes a mechanism to delay the scale-in.
Definitely a feature that would be good to have, is there any further update on this/whether its being worked on?
Hello, is there any news?
I would heavy need this.
Expected to make consolidateAfter
works with WhenUnderutilized
.
Like described here https://github.com/kubernetes-sigs/karpenter/issues/735#issuecomment-1864822421
The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?
You are overthinking this!
Just help us reduce node thrashing FFS.
The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion.
@ellistarn I disagree. The following algorithm might be "correct" and effective:
1) Record the current state of the cluster 2) Run a simulation of the cluster and record its proposed consolidated state. (expensive) 3) On every pod creation, try to schedule the pod on both the current and consolidated cluster (cheap) 4) If the pod schedules on the current but not the consolidated cluster, then that would indicate thrashing. 4b) Re-run the simulation at this "high-watermark" to create a new proposed less-consolidated state. 4c) Reset timer 4d) go to step 3 5) If 'consolidateAfter" elapses without running into situation 4, then proceed with the consolidation.
I've advocated for some more nuanced mechanisms (in other words, rejecting a plain TTL approach). If we only allow the simple thing, we might make it harder to support more complex mechanisms. I'm keen that if we do paint ourselves into an API design corner, we do it with our eyes open.
If the simple TTL approach is what folks are keen to have, I wonder if we can allow either:
.spec.consolidationMinimumDelay
(day 1)
.spec.consolidationPolicy
(day n, rules specified in CEL based on values exposed by Karpenter; potentially, reevaluated whenever one input value changes)
@almson , @sftim hey please, for beginning just add delay before nodes get consolidated like described in comment https://github.com/kubernetes-sigs/karpenter/issues/735#issuecomment-1864822421
You can think about other non trivial mechanisms later on. For now it will be enough. Also it okey if you change API later on, you already did like this with previous Karpenter releases, I believe nobody will blame you.
You did awesome product, please add this and I think it will transform from awesome to perfect=)
We have the exact same issue. Can't we learn from cluster autoscaler and implement the feature scale-down-delay? (which would look like .spec.consolidationMinimumDelay
). It sounds like we lost a need covered in the past by moving to Karpenter. We have hundreds of developers complaining of pods being disrupted due to scaling up/down events, some pods are disrupted more than 20 times in an hour. Having a scale-down delay will mitigate this.
I am considering reverting the move and switching back to cluster autoscaler just for that missing feature.
Let me know if we can help on this & how (design document, implementation, etc.).
Another complication that Karpenter makes worse is that certain metrics vendors may bill based on unique metrics series produced over a month. Karpenter's consolidation behavior will actively make this worse as it replaces/reschedules workloads and it'd be useful if we could constrain how often this might happen.
I agree default scaledown / consolidation behaviour is too aggressive in some use cases.
Most Karpenter users have previously used Cluster Auto Scaler which is much much slower in scale downs. From CAS docs at https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work:
If a node is unneeded for more than 10 minutes, it will be terminated [by Cluster Auto Scaler]
I understand Karpenter's key purpose is to avoid unnecessary cloud cost, but in some cases Karpenter operators disagree with what Karpenter deems unnecessary. For CI/CD and other Jobs heavy use cases, having young nodes stick around while underutilized, because new work is expected, is great for devex and justifies cost. IMHO.
Something like the suggested .spec.consolidationMinimumDelay
would meet the basic need of delaying consolidation. Other advanced methods would probably be used less?
For CI/CD and other Jobs heavy use cases, having young nodes stick around while underutilized, because new work is expected, is great for devex and justifies cost.
I (still) think we're missing an opportunity to use placeholder Pods to provide overhead capacity, leaving the scheduler to preempt those placeholders as appropriate and for Karpenter to stick to its knitting around automatic scaling. A controller to manage those placeholder Pods could make a great addition to the Kubernetes project. However, we're missing a really good set of user stories that we can use to try to fit solutions against.
(Would anyone like to work on collating that set of user stories?)
One thing to remember: if we add .spec.consolidationMinimumDelay
, we probably have to live with that decision for a decade or more. Putting things into an API that will graduate to stable has costs as well as benefits.
Hi @sftim
I (still) think we're missing an opportunity to use placeholder Pods to provide overhead capacity, leaving the scheduler to preempt those placeholders as appropriate and for Karpenter to stick to its knitting around automatic scaling. A controller to manage those placeholder Pods could make a great addition to the Kubernetes project.
This feature already exists and does not even need a (new) controller. A Deployment managing N right-sized low priority pause 'placeholder' pods, should keep the nodes busy. However, you probably want to add scheduled scale up/downs, to avoid having these placeholder pods run 24/7 (and rack cost). That's DIY. For the basics, have a look at https://github.com/codecentric/cluster-overprovisioner/ (not my project).
Putting things into an API that will graduate to stable has costs as well as benefits.
I totally agree. However, I believe the Karpenter people have ample experience adding features and later on extending/replacing them, so let's see what the future will bring.
Putting things into an API that will graduate to stable has costs as well as benefits.
Why not use the v0 version to test things out before being blocked by v1? If it works and solves everyone's pains then we keep it; if we need some improvements we can review it before v1.
Why not use the v0 version
Ah. We're at beta (API version v1beta1) and are planning to stabilize the API (v1).
Kubernetes has a particular approach to API version round trips, that we intend to follow. So adding things in even at beta has constraints.
@sftim there no need to change API, here is the direct and clear proposal:
Use property that karpenter already have consolidateAfter
and make it work with consolidationPolicy
being set WhenUnderutilized
. It's fits ok in API from architectural perspective.
Implement logic that will enable users to use consolidateAfter
when consolidationPolicy
being set to WhenUnderutilized
.
Behavior of this logic will be same as behavior when consolidationPolicy
set to WhenEmpty
.
When consolidationPolicy
set to WhenEmpty
and for example we set consolidateAfter
to 30s
it will delay consolidations for 30s
.
Adjust comment here in documentation. Change from:
...
disruption:
# Describes which types of Nodes Karpenter should consider for consolidation
# If using 'WhenUnderutilized', Karpenter will consider all nodes for consolidation and attempt to remove or replace Nodes when it discovers that the Node is underutilized and could be changed to reduce cost
# If using `WhenEmpty`, Karpenter will only consider nodes for consolidation that contain no workload pods
consolidationPolicy: WhenUnderutilized | WhenEmpty
# The amount of time Karpenter should wait after discovering a consolidation decision
# This value can currently only be set when the consolidationPolicy is 'WhenEmpty'
# You can choose to disable consolidation entirely by setting the string value 'Never' here
consolidateAfter: 30s
...
to
...
disruption:
# Describes which types of Nodes Karpenter should consider for consolidation
# If using 'WhenUnderutilized', Karpenter will consider all nodes for consolidation and attempt to remove or replace Nodes when it discovers that the Node is underutilized and could be changed to reduce cost
# If using `WhenEmpty`, Karpenter will only consider nodes for consolidation that contain no workload pods
consolidationPolicy: WhenUnderutilized | WhenEmpty
# The amount of time Karpenter should wait after discovering a consolidation decision
# You can choose to disable consolidation entirely by setting the string value 'Never' here
consolidateAfter: 30s
...
After upgrading to v0.32.8 from v0.31.0, I see more aggressive node launching and termination operations with the same request pattern. I followed the https://karpenter.sh/v0.32/upgrading/v1beta1-migration/#ttlsecondsafterempty to switch ttlSecondsAfterEmpty
of v1alpha to the new
consolidationPolicy: WhenEmpty
consolidateAfter: 2m
of v1beta1, however, no matter how I set up the consolidateAfter value even up to 60m
, the node gets terminated quickly which makes the running node reusable impossible.
Created an issue here - https://github.com/aws/karpenter-provider-aws/issues/5938. Starting March 26th, there are an increasing number of pending pods. This upgrade makes the scheduler less responsive due to the unnecessary nodes registering and leaving the cluster.
Could anyone have a sugesstion regarding this problem? The configuration and logs are in the above issue link.
Thanks
+1 to make consolidateAfter
work with WhenUnderutilized
!
Thanks @Luke-Smartnews for your #992, I hope it's merged soon!
Tell us about your request
We have a cluster where there are a lot of cron jobs which run every 5 minutes...
This means we have 5 nodes for our base workloads and every 5 minutes we get additional nodes for 2-3 minutes which are scaled down or consolidated with existing nodes.
This leads to a constant flow of nodes joining and leaving the cluster. It looks like the docker image pull and node initialization creates more network traffic fees than the cost reduction of not having running the instances all the time.
It would be great if we could configure some time consolidation period maybe together with ttlSecondsAfterEmpty which would only cleanup or consolidate nodes if the capacity was idling for x amount of time.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Creating a special provisioner is quite time consuming because all app deployments have to be changed to leverage it...
Are you currently working around this issue?
We think about putting cronjobs into a special provisioner which would not use consolidation but the ttlSecondsAfterEmpty feature.
Additional Context
No response
Attachments
No response
Community Note