kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
610 stars 203 forks source link

Parameterize Multinode + Single node consolidation timeout #1733

Open Pokom opened 4 weeks ago

Pokom commented 4 weeks ago

Description

[!NOTE] This is similar to #903 but distinct in that this happens to our largest clusters regardless of scale up/scale down activity. #1031 was opened but closed due to needing an RFC, and I would like to work on putting that together.

What problem are you trying to solve?

Provide the ability to configure the values for multinode and single node consolidation timeouts.

In large clusters with somewhat complex nodepool setups and anti affinity rules, we're consistently running into timeouts for the consolidation process. The impact is that our clusters are over provisioned as nodes aren't being taken offline.

We're also working on profiling karpenter and working to identify where the bottleneck is in the code, as a stopgap this would be a nice feature to have.

It would be really for us to be able to configure these values at runtime and set them to values that would allow the consolidation process to finish. It may not be as fast as the default values, but having it finish slower is preferable to having the process timeout and never complete.

How important is this feature to you?

This is very important as Karpenter is causing a fairly large uptick in spend for large clusters because the consolidation can't process fast enough.

njtran commented 2 weeks ago

/triage accepted

Pokom commented 2 weeks ago

@njtran are you open to an outside contribution for the issue? I should have some time this week to get a PR ready