kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
573 stars 189 forks source link

Spot -> Spot Consolidation #763

Closed matti closed 7 months ago

matti commented 1 year ago

Tell us about your request

originally discussed in aws/karpenter#1091, but when the conversation started to be fruitful "aws locked as resolved and limited conversation to collaborators"

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

using karpenter costs a lot of money after a scale up when nodes are left running behind

Are you currently working around this issue?

various approached discussed in #

Additional Context

No response

Attachments

No response

Community Note

please don't close this and keep conversation going.

tzneal commented 1 year ago

@matti From my comment on the other issue,

I'll be locking this issue as it was for the original consolidation feature and it's difficult to track new feature requests on closed issues. Feel free to open a new one specifically for spot replacement for that discussion.

I happened to see a notification for discussion on a closed issue, but we don't monitor them in any way and it's very likely that discussion on closed issues will get missed.

SohumB commented 1 year ago

To summarise the existing discussion, then: some users (including me!) see a need for spot workload consolidation. Not everyone would enable this if it existed, but some users are running workloads that are spiky enough to provision large machines, and then watch them be underutilised after the spike subsides.

This comment from @tzneal seems to capture the karpenter team's thought process on spot consolidation at the moment:

The original problem that forced not replacing spot nodes with smaller spot nodes is that the straight forward approach eliminates the utility of the capacity optimized prioritized strategy that we use for launching spot nodes which considers pool depth and leads to launching nodes that may be slightly bigger than necessary, but are less likely to be reclaimed.

If consolidation did replace spot nodes, we would often launch a spot node, get one that's slightly too big due to pool depth and interruption rate, and then immediately replace it with a cheaper instance type with a higher interruption rate. This turns our capacity optimized prioritized strategy into a Karpenter operated lowest price strategy which isn't recommended for use with spot.

This makes sense to me as a limitation of the simplest possible strategy, so let me outline what I would've naïvely expected karpenter to do when asked for both spot instances and consolidation:

As I understand it, this would ensure that price alone wouldn't be the only signal causing a shift of node allocations, which would solve the concern of the interaction between the two systems not quite matching desired behaviour. Please let me know if I've missed something!

tzneal commented 1 year ago

compute a totally fresh capacity-optimized-prioritized bucketting with those pods and current spot prices

There is no API to do this without also actually launching the replacement instances, even if they're worse. We use CreateFleet and give it the flexibility to choose the best instance types, it does that and also launches the instances in parallel.

ellistarn commented 1 year ago

@tzneal I wonder if there's a heuristic to avoid race to the bottom behavior. e.g., what if we were to only take this consolidation action if you could improve by x% (e.g. 25%) in this special case?

bwagner5 commented 1 year ago

It may be possible to know if we get a "good enough" spot instance AND it's cheaper, then we could do that using the Spot Placement Score. It's not a straight forward thing to do though and the call pattern we would need may not work. This is something we have been exploring a little bit though.

tzneal commented 1 year ago

I've briefly investigated both, the problem with a heuristic is that I think you'd also have to add some throttle (e.g. don't replace a spot node unless it's an hour old). It's possible to request an instance and get one that's N% too big. The bigger N, the less chance but I don't think the chance ever goes to zero. Without a throttle we would then start creating an instance and then replacing it with the same instance type as quickly as possible. With a one hour throttle you reduce that to needlessly cycling a node once an hour. It's a similar problem to using a tool that drains nodes if they reach X% utilization, without more information you don't know that X% utilized wasn't as optimal as is possible given current conditions.

I looked at the placement score and can't find the rate limit information right now, but I remembered it being pretty low. You could quickly use up your quota without successfully consolidating and then be in the same place where we can't successfully check the spot placement score.

liorfranko commented 1 year ago

Any update regarding this?

ellistarn commented 1 year ago

Had a chat with @tzneal on our options here.

The problem

  1. We want to avoid "race to the bottom" Spot->Spot consolidation behavior, which cases unacceptable interruption rates
  2. We want to avoid launching spot nodes that are more expensive than OD counterparts
  3. We need some sort of bounding mechanism to control our descent up and down the ladder

Relative OD/Spot Bounding

The core idea is to stop binpacking for both provisioning and consolidation once we find an OD instance that's more expensive than existing Spot options. Adding a more expensive Spot instance than an OD alternative is worse on both price and availability. From our discussion, this one wasn't controversial, and we should do this regardless of our choices below. It also completely solves https://github.com/aws/karpenter/issues/3044

Price vs Time Bounding

On the way down (spot->spot), we need something to stop us from replacing a PCO choice with a cheaper PCO choice, until we've converged on LowestPrice.

  1. We could only consider a spot node for consolidation after a certain amount of time (e.g. 2 hours)
  2. We could only consider a spot node for consolidation if the price enough better (e.g. factor of 2)

@tzneal prefers 1, while I prefer 2, though I think we're both open minded on the details here. We should assume that these won't be configurable to start with, but customers may eventually require them to be configurable. Conceptually, they're different ways to define the tradeoff between price and availability.

If we go with the price factor, it would interact with OD/Spot bounding. We would take the minimum of the two bounds.

tzneal commented 1 year ago

There's one more complication on spot to spot that I think affects both methods. Suppose your provisioner only supports 2xl,4xl and 8xl as instance types and all are cheaper than the cheapest OD 2xl or within the price factor:

You currently have an 8xl spot node, when a 2xl would suffice. When we do the replacement spot to spot consolidation launch, do we: a) Send all three types (2xl, 4xl, and 8xl), possibly getting an 8xl in return if it's still the most available, which we then immediately terminate. b) Send only the 2xl and 4xl types, and get whichever is better out of that limited set.

Option A tries to minimize the disruption and simulates what would happen if those pods just arrived, we would get the 8xl node.

Option B tries to minimize price, regardless of disruption, and diverges from what would happen if those pods were newly scheduled in that we artificially limit the instance types that PCO chooses from.

I think option A seems odd at first, but makes the most sense given its what would occur if those same pods were newly created and pending. I

stevehipwell commented 1 year ago

@ellistarn I'd vote for time bounding (option 1) as it fixes the problem domain in a deterministic way; spot instances are created based on a cost request and then re-evaluated at a fixed interval. This leaves only a single dynamic dimension covering interruption, instead of two if cheaper instances are made available.

It's been a long day so I might be miles off on this; but wouldn't option 2 potentially expose workloads using Karpenter to interference by someone able and incentivised to manipulate the spot markets? Even if that wasn't possible/likely you still might see strange behaviour during large workload shifts in a region (batch jobs).

ellistarn commented 1 year ago

Great feedback @stevehipwell. To your second point, we update our prices periodically (not immediately), so you wouldn't see huge impact here.

liorfranko commented 1 year ago

I agree with trying after 2 hours with the same types, and in case the same instance was chosen the next cycle can be 30 minutes.

mballoni commented 1 year ago

This will be game changer for an tools that is already awesome. I have both scenarios here:

  1. some large pods from data pipeline that requires a lot of resources spin up a very large spot instance. When the job is done some other "permanent" pods may keep this node alive indefinitely and very underutilized for a high price.

  2. changes on spot market and availability. I manually tried to tune one cluster and the cost fell from nearly 400 to 184 dolars , a 50% reduction we would not see if not by manual intervention. Spot to spot consolidation would achieve this optimization by itself

matti commented 1 year ago

I think the only solution is to use something like https://github.com/kubernetes-sigs/descheduler

orrj-nym commented 1 year ago

I think the only solution is to use something like https://github.com/kubernetes-sigs/descheduler

Could you please expand on that? AFAIK descheduler only evicts pods from over-utilized nodes and schedules them back to under-utilized nodes.

adiffpirate commented 1 year ago

I think the only solution is to use something like https://github.com/kubernetes-sigs/descheduler

Could you please expand on that? AFAIK descheduler only evicts pods from over-utilized nodes and schedules them back to under-utilized nodes.

The behaviour you mentioned is the "LowNodeUtilization" strategy. But you can use the HighNodeUtilization strategy to evict pods from under-utilized nodes, forcing Karpenter to schedule them again.

Speaking from experience, it works fine (the only downside being the need to install/maintain descheduler).

The only catch is configuring descheduler with a sensible interval. If the interval is too short you will likely endup in cases where Karpenter schedules the same node (because is currently the best according to its strategy) and descheduler keeps killing it over and over again. Personally, I like the 24h interval.

Helm values I use for descheduler:

kind: Deployment

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

cmdOptions:
  v: 4

deschedulingInterval: 24h

deschedulerPolicy:
  nodeSelector: "descheduler=enabled"
  evictLocalStoragePods: true
  strategies:
    RemoveDuplicates:
      enabled: false
    RemovePodsHavingTooManyRestarts:
      enabled: false
    RemovePodsViolatingNodeTaints:
      enabled: false
    RemovePodsViolatingNodeAffinity:
      enabled: false
    RemovePodsViolatingInterPodAntiAffinity:
      enabled: false
    RemovePodsViolatingTopologySpreadConstraint:
      enabled: false
    LowNodeUtilization:
      enabled: false
    HighNodeUtilization:
      enabled: true
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            cpu: 50 # Percentage
            memory: 50 # Percentage

The nodeSelector config enables you to enable/disable descheduler on a per-node basis (will only evict pods from nodes that has the descheduler=enabled label.

orrj-nym commented 1 year ago

I think the only solution is to use something like https://github.com/kubernetes-sigs/descheduler

Could you please expand on that? AFAIK descheduler only evicts pods from over-utilized nodes and schedules them back to under-utilized nodes.

The behaviour you mentioned is the "LowNodeUtilization" strategy. But you can use the HighNodeUtilization strategy to evict pods from under-utilized nodes, forcing Karpenter to schedule them again.

Speaking from experience, it works fine (the only downside being the need to install/maintain descheduler).

The only catch is configuring descheduler with a sensible interval. If the interval is too short you will likely endup in cases where Karpenter schedules the same node (because is currently the best according to its strategy) and descheduler keeps killing it over and over again. Personally, I like the 24h interval.

Helm values I use for descheduler:

kind: Deployment

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

cmdOptions:
  v: 4

deschedulingInterval: 24h

deschedulerPolicy:
  nodeSelector: "descheduler=enabled"
  evictLocalStoragePods: true
  strategies:
    RemoveDuplicates:
      enabled: false
    RemovePodsHavingTooManyRestarts:
      enabled: false
    RemovePodsViolatingNodeTaints:
      enabled: false
    RemovePodsViolatingNodeAffinity:
      enabled: false
    RemovePodsViolatingInterPodAntiAffinity:
      enabled: false
    RemovePodsViolatingTopologySpreadConstraint:
      enabled: false
    LowNodeUtilization:
      enabled: false
    HighNodeUtilization:
      enabled: true
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            cpu: 50 # Percentage
            memory: 50 # Percentage

The nodeSelector config enables you to enable/disable descheduler on a per-node basis (will only evict pods from nodes that has the descheduler=enabled label.

I can confirm this solution works as intended! Thanks!

mercuriete commented 1 year ago

@mballoni I am not an expert but I would recommend you to try a 2 provisioner approach. 1 default provisioner for all workloads 1 provisioner for your data pipeline jobs.

Then tag the provisioners and tag the deployments. The scheduler should be smart enough to schedule the data pipeline for the tagged big nodes and the other workloads should be scheduled on other nodes with different tags.

I understand that this is too much difficult comparing with spot consolidation but.... At least you could workaround your problem.

TLDR: 2 provisioners with tags.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tzneal commented 8 months ago

/remove-lifecycle stale

jonathan-innis commented 7 months ago

I'm pretty sure that this is closed-out now with #897! You can enable spot-to-spot consolidation on v0.34.0 of Karpenter in the AWS provider with the SpotToSpotConsolidation feature flag.

calebAtIspot commented 7 months ago

Relevant: https://karpenter.sh/v0.34/reference/settings/

Environment Variable | CLI Flag | Description
-- | -- | --
FEATURE_GATES | --feature-gates | Optional features can be enabled / disabled using feature gates. Current options are: Drift,SpotToSpotConsolidation (default = Drift=true,SpotToSpotConsolidation=false)

Also, there's a interesting caveat to this feature:

The spot consolidation would kick-in only when there are a minimum of 15 instanceTypes for the new NodeClaim to replace the current spot candidate

This seems unlikely to effect you, unless you heavily restrict the possible instance types.