kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
539 stars 179 forks source link

Karpenter should consolidate while provisioning new instances #1466

Open cnmcavoy opened 1 month ago

cnmcavoy commented 1 month ago

Description

What problem are you trying to solve? Karpenter's provisioner only takes into consideration existing pending pods when determining what new instances to launch. When workloads are launched in a continuous trickle instead of all-at-once, this results in Karpenter frequently preferring smaller 2x and 4x large instances. Eventually, Karpenter's multi-node consolidation will begin reducing these instances into 8x, 12x, 18x, etc, but this process of "settling" takes days / weeks depending on the cluster's depth, disruption budgets, etc. Eventually, all the pods will be disrupted to move onto a larger instance size, and possibly multiple times if Karpenter has to make repeated passes as multi-node consolidation.

Karpenter should have examined the cluster state before provisioning and to consolidate and provision a larger node from the outset. This would avoid later disruptions from the multi-node consolidation.

How important is this feature to you? It would reduce the number of unnecessary disruptions for clusters.

njtran commented 1 month ago

Karpenter should have examined the cluster state before provisioning and to consolidate and provision a larger node from the outset. This would avoid later disruptions from the multi-node consolidation.

This is definitely a reasonable ask. Our current algorithms do:

  1. Provisioning - Greedy, best effort, fast as possible
  2. Consolidation - comprehensive, cautious, maximize availability

This allows us to minimize pod startup latency, maximizing availability, prioritizing it over cost. One thing you've highlighted here and before is that there's no coordination between these two to make less total decisions, and minimize pod churn.

I'd be interested to ideate on what you think are some good ways we could solve this!

njtran commented 1 month ago

/triage accepted

sftim commented 1 month ago

Idea: ant colony optimisation

Essentially, we produce a plan that Karpenter thinks allows the scheduler to find a solution; we don't check or care that the scheduler finds the same solution, so long as it costs what we thought it would cost. If the scheduler does a worse job than our ant agents, Karpenter can react to that by provisioning more nodes once the situation is clear.

sftim commented 1 month ago

Idea: imaginary Pods

When Karpenter is idle (eg no unschedulable Pods, no recent consolidations, no identified grounds to consolidate), search for Pods' owners (eg, ReplicaSet). Find the top n owning objects. For each owner that survives the filter, simulate creating a surge Pod with a plausible node selector (pick an existing Pod and duplicate it). Place that new Pod in a simulation of an enlarged cluster, then consolidate the result. Cache the outcome as a hint for a NodeClaim solution. Abort early if Karpenter finds it has actual work to do.