Exponential / logarithmic decay for cluster desired size

sftim commented 1 year ago

Tell us about your request

When Karpenter is running more node capacity than the cluster requires, use an exponential decay (ie, something with a half life) rather than dropping desired capacity instantly.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

As a cluster operator, When my workloads scale in on my cluster I want to preserve capacity So that short-term drops in workload scale don't disrupt service.

I'm suggesting exponential decay because it's easy to implement with two fields (eg: in the .status of each Provisioner)

the most recent, post-decay, value
a timestamp for that value. Either with subsecond precision, or scale the value to match the timestamp at the beginning of a second

With some not so complex math, you can then evaluate the decayed value for any subsequent instant. You can write it back into the status (eg using a JSON Patch) and you can act on it as well.

This might better support:

gradual scale down with different nodes eg: cluster running on 3 × 192 CPU metal instances, 560 CPU desired size, memory not a constraint A bunch of those Pods stop running Time passes With a small decay constant set, the desired size decays slowly to 507 CPUs Karpenter predicts / observes the slow decay and replaces 1 of the 192 CPU instances, with 2 smaller 64 CPU instances.
holding capacity over brief breaks eg: lunch A workload's utilization drops in some region over lunch to nearly 0, but the Pods take 2 to 3 minutes to initialise. Pod level predictive autoscaling already accounts for this but there is a 10 to 20 minute period where do Pods scale in due to low utilisation. After lunch, the load on the cluster is usually lower than in the morning, but is still quite high. The cluster operator would like to run preemptible batch work using the spare capacity, ready to be replaced by workload Pods when the lunch break is nearly over, and sets a decay constant to make sure to avoid too much node-level scale in.
live event capacity A workload supports a live event. Queue processing runs in Kubernetes and is scaled to a high level for the event itself. After the event the site remains popular but with bursty load. A cluster operator would like to save money through consolidation and also wants to turn off unused nodes promptly to save on costs. However, turning nodes off too quickly turns out to have its own cost implications: additional nodes get launched to replace kubelets on instances that are only just starting their shutdown process.

Alternative

rather than exponential decay, use another function such as logarithmic decay. That would hold the instance count for a duration and then let it drop off. That might better fit cases where cluster operators want to minimize instance terminations.

Are you currently working around this issue?

(eg) scaleDown policies on HorizontalPodAutoscaler. However, these affect single workloads. A correlated scale-in could still take away node capacity that I, as a cluster operator, know will take time to reprovision if needed.

Additional Context

Also see https://kubernetes.slack.com/archives/C02SFFZSA2K/p1685980025031979?thread_ts=1685960637.488689&cid=C02SFFZSA2K

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

sftim commented 12 months ago

https://github.com/aws/karpenter-core/issues/735 adds a user story relevant to this: minimizing the AWS Config costs from frequent provisioning / termination cycles for EC2 instances.

njtran commented 8 months ago

Thinking about this in the perspective of disruption budgets: could this be implemented by a budget with a percentage?

Let's say I had 1000 nodes in my cluster, and let's say they're all empty, meaning that the desired state would be scale to 0. With a disruption budget of 10%, you could achieve the same logarithmic decay, by effectively scaling down the cluster in progressively smaller batches, eventually scaling down to 0.

1000 ( - 100) -> 900 ( - 90) -> 810 ( - 81) -> 729 ( - 73) -> 656 (-66) -> 590 -> ... -> 0

This effectively solves the problem of exponential decay, in my eyes. @sftim thoughts?

One consideration is that this drifts from perfectly exponential the more heterogenous the instance sizes are. Yet, the super nice part is that this effectively gets solved for free with an already existing design/implementation in progress.

sftim commented 8 months ago

There's two shapes for decay. For scale-in, these are:

big steps first, then smaller and smaller steps (exponential)
small reductions at first, then bigger and bigger steps (logarithmic)

I actually think the second case is more relevant. People want to keep nodes around in case the load comes back, but eventually they still want their monthly bill to go down.

On the node size thing, we could implement this where you specify the dimension you care about. For example, decay the total vCPU count for a NodePool. Or the node count, or the memory total. Maybe even the Pod capacity?

sftim commented 8 months ago

/retitle Exponential / logarithmic decay for cluster desired size

If we plan to implement just one of these, that could turn into a separate more specific issue.

njtran commented 8 months ago

On the node size thing, we could implement this where you specify the dimension you care about. For example, decay the total vCPU count for a NodePool. Or the node count, or the memory total. Maybe even the Pod capacity?

This totally makes sense. There was some feedback that DisruptionBudgets should refer to more than just nodes, which seems super similar to this request.

big steps first, then smaller and smaller steps (exponential) small reductions at first, then bigger and bigger steps (logarithmic)

I understand the use case in doing big steps first with progressively smaller steps, and that's naturally implemented with a budgets.

What's the use-case for doing smaller steps with progressively larger steps? That sounds like it would be something like doing 1000 -> 999 -> 997 -> 993 -> 985 -> 969 -> 937 -> 873 -> 745 -> 489 -> 0. While not impossible, I think this would be harder to model, since you have to be aware of previous steps to know the next step.

sftim commented 8 months ago

Let's do the simpler thing then, with exponential decay.

sftim commented 8 months ago

scaling down the cluster in progressively smaller batches, eventually scaling down to 0

I do think it's nicer to scale in without the jaggedness this implies. Each time the desired size drops below the integer actual count of nodes, I think a cluster operator would hope to see a drain happening - and eventually an instance termination.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Bryce-Soghigian commented 4 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / karpenter