kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
535 stars 175 forks source link

Price Improvement Threshold #1440

Open wmgroot opened 1 month ago

wmgroot commented 1 month ago

Description

One piece of implementation for RFC #1433 (Consolidation Policies)

What problem are you trying to solve? Make it possible to balance the cost of disrupting pods against the cost of using a more expensive instance type.

Currently, single-node consolidation will replace any instance with a cheaper version that can house the same pods, resulting in very frequent pod disruption. Disruption pods has its own cost, which for certain workloads can negate savings or result in a net loss of money.

This concept has already been documented as an improvement for spot consolidation, but I see no reason why it wouldn't be useful for generalized consolidation. https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md#2-price-improvement-factor

How important is this feature to you? Very important. We are currently running a fork of Karpenter with single-node consolidation completely disabled due to the frequent node disruption we are seeing for small price improvements (individual pods are frequently being disrupted multiple times an hour).

ellistarn commented 1 month ago

Just brainstorming:

I'm curious if we might want to reason about this as spec.disruption.cost. We factor in a percentage of the cost of the node when making consolidation decisions. We might use this to sort nodes, and favor smaller nodes for consolidation. It also happens to directly map to priceImprovementFactor: 1.1 == disruption.cost: .1. Another thing to think about is that Kubernetes APIs don't allow floats, so we may need to do something like priceImprovementPercent: 110 or disruption.costPercent: 10.

wmgroot commented 1 month ago

To check my understanding of disruption.costPercent: 10, you're effectively giving users a way to say that "I'm losing 10% of what I'm paying for this node, so make sure the savings for the replacement node are more than 10% before replacing this node"?

If I'm understanding that correctly, it sounds to me like it would be better placed on an individual NodeClaim spec rather than in NodePool.spec.disruption.

ellistarn commented 1 month ago

If I'm understanding that correctly, it sounds to me like it would be better placed on an individual NodeClaim spec rather than in NodePool.spec.disruption.

This is a good point. Similar line of reasoning to TGP and expireAfter.

If we model it as priceImprovementFactor, then it might be considered a property of the NodePool. I think it makes more sense at the NodeClaim level, though. Given this, how would you expect it to interact with drift?

wmgroot commented 1 month ago

Additionally, it could potentially support fixed or percent values via IntOrString, such as:

NodeClaim.spec.disruptionCost: 10%
NodeClaim.spec.disruptionCost: 20 (effectively saying a one-time disruption of this node would cost $20)

The problem with fixed value disruption costs is that karpenter has no way to predict how long the node would be around in the future to gauge if a one-time $20 cost is worth it. Maybe it could use historical data on node age to make a prediction.

wmgroot commented 1 month ago

Given this, how would you expect it to interact with drift?

I expect drift to not take this feature into account at all. If the node needs to be replaced, it needs to be replaced regardless of cost. When a new nodeclaim is provisioned to replace each node, the usual provisioning logic should still choose the cheapest available, which could result in minor cost improvements that are below the configured threshold.

njtran commented 4 weeks ago

/triage accepted

njtran commented 4 weeks ago

I've marked this as accepted, but I think we definitely need to talk this out. Excited to discuss more about this