Karpenter should consolidate while provisioning new instances

cnmcavoy commented 1 month ago

Description

What problem are you trying to solve? Karpenter's provisioner only takes into consideration existing pending pods when determining what new instances to launch. When workloads are launched in a continuous trickle instead of all-at-once, this results in Karpenter frequently preferring smaller 2x and 4x large instances. Eventually, Karpenter's multi-node consolidation will begin reducing these instances into 8x, 12x, 18x, etc, but this process of "settling" takes days / weeks depending on the cluster's depth, disruption budgets, etc. Eventually, all the pods will be disrupted to move onto a larger instance size, and possibly multiple times if Karpenter has to make repeated passes as multi-node consolidation.

Karpenter should have examined the cluster state before provisioning and to consolidate and provision a larger node from the outset. This would avoid later disruptions from the multi-node consolidation.

How important is this feature to you? It would reduce the number of unnecessary disruptions for clusters.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

njtran commented 1 month ago

Karpenter should have examined the cluster state before provisioning and to consolidate and provision a larger node from the outset. This would avoid later disruptions from the multi-node consolidation.

This is definitely a reasonable ask. Our current algorithms do:

Provisioning - Greedy, best effort, fast as possible
Consolidation - comprehensive, cautious, maximize availability

This allows us to minimize pod startup latency, maximizing availability, prioritizing it over cost. One thing you've highlighted here and before is that there's no coordination between these two to make less total decisions, and minimize pod churn.

I'd be interested to ideate on what you think are some good ways we could solve this!

njtran commented 1 month ago

/triage accepted

sftim commented 1 month ago

Idea: ant colony optimisation

Model Pods as agents that are placed onto Nodes but that can move to another node[claim] if they want to. Also model “hormones“ such as: node cost effectiveness vs. the average, node memory capacity, node CPU capacity. Also have a signal for static and bare Pods so that the agent representing them resists migrating.
Model nodes and nodeclaims as, well, nodes in the ant colony. Use a hunger hormone to trigger node provisioning, and a spare capacity hormone to trigger (simulated) NodeClaim removal. Potentially, use a hormone for NodeClaims to encourage migration away from these nodes (accounting for the fact that there may be nodes Karpenter can't control, and we'd like to ensure we bin-pack to those other nodes). Iterate the model until most agents have found a placement they are happy with.
I think I would even model required constraints as a preference; an agent could consider migrating to a node that doesn't pass required scheduling constraints, but would find itself with a low preference score.
Then run a pass to check that our models of kube-scheduler and kubelet wouldn't preempt or evict the Pods as placed.
- If the simulation doesn't pass, use the existing node provisioning code or something like it. Then maybe rerun.
- Maybe also simulate whether consolidation would trigger before accepting the solution.

Essentially, we produce a plan that Karpenter thinks allows the scheduler to find a solution; we don't check or care that the scheduler finds the same solution, so long as it costs what we thought it would cost. If the scheduler does a worse job than our ant agents, Karpenter can react to that by provisioning more nodes once the situation is clear.

sftim commented 1 month ago

Idea: imaginary Pods

When Karpenter is idle (eg no unschedulable Pods, no recent consolidations, no identified grounds to consolidate), search for Pods' owners (eg, ReplicaSet). Find the top n owning objects. For each owner that survives the filter, simulate creating a surge Pod with a plausible node selector (pick an existing Pod and duplicate it). Place that new Pod in a simulation of an enlarged cluster, then consolidate the result. Cache the outcome as a hint for a NodeClaim solution. Abort early if Karpenter finds it has actual work to do.

kubernetes-sigs / karpenter