[!NOTE]
This is similar to #903 but distinct in that this happens to our largest clusters regardless of scale up/scale down activity. #1031 was opened but closed due to needing an RFC, and I would like to work on putting that together.
In large clusters with somewhat complex nodepool setups and anti affinity rules, we're consistently running into timeouts for the consolidation process. The impact is that our clusters are over provisioned as nodes aren't being taken offline.
We're also working on profiling karpenter and working to identify where the bottleneck is in the code, as a stopgap this would be a nice feature to have.
It would be really for us to be able to configure these values at runtime and set them to values that would allow the consolidation process to finish. It may not be as fast as the default values, but having it finish slower is preferable to having the process timeout and never complete.
How important is this feature to you?
This is very important as Karpenter is causing a fairly large uptick in spend for large clusters because the consolidation can't process fast enough.
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
Description
What problem are you trying to solve?
Provide the ability to configure the values for multinode and single node consolidation timeouts.
In large clusters with somewhat complex nodepool setups and anti affinity rules, we're consistently running into timeouts for the consolidation process. The impact is that our clusters are over provisioned as nodes aren't being taken offline.
We're also working on profiling karpenter and working to identify where the bottleneck is in the code, as a stopgap this would be a nice feature to have.
It would be really for us to be able to configure these values at runtime and set them to values that would allow the consolidation process to finish. It may not be as fast as the default values, but having it finish slower is preferable to having the process timeout and never complete.
How important is this feature to you?
This is very important as Karpenter is causing a fairly large uptick in spend for large clusters because the consolidation can't process fast enough.