kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
432 stars 148 forks source link

Add configurable dedupe timeout to NodeFailedToDrain event #1021

Open evq opened 4 months ago

evq commented 4 months ago

Description

What problem are you trying to solve? Hey there, we are currently utilizing spot instances provisioned via karpenter to run some relatively expensive workloads. For cost optimization reasons we want to run as few replicas as possible, in some cases this means a single instance. This obviously has major trade-offs in terms of availability, by default these workloads would have downtime every time a spot interruption occurs. In order to mitigate this, we have a PDB with a minimum available set to 1 which blocks the node from being prematurely drained. We then have a custom scaler triggered on the NodeFailedToDrain event which temporarily scales up to 2 replicas to allow for the replacement pod to gracefully start before the spot termination occurs.

This is all fine and good and works quite well, however one of the knobs on the custom scaler essentially controls how long to wait since the last NodeFailedToDrain event before we scale back down. We currently have this set to around 2m15s based on the default event dedupe timeout and apparent max retry time on the disruption queue. It'd be nice to be able to lower this further but doing so would seem to require changing the event dedupe time on NodeFailedToDrain. I see that there is the ability to set a per-event override, wondering if it would make sense to have some sort of configuration value which controls the dedupe time ( either globally or on a per-event basis. )

How important is this feature to you? Nice to have

jonathan-innis commented 4 months ago

wondering if it would make sense to have some sort of configuration value which controls the dedupe time

What's the impact of not being able to configure this value? Is it just higher cost since you have two pods running? I would also assume that Karpenter creates a new node that has enough space for those two pods rather than just for the one because we simulate pod capacity for pods on nodes that we know are going away.

jonathan-innis commented 4 months ago

Also, it seems like this issue might be relevant to y'all. If we just waited around up until the point that the terminationGracePeriod of the pod would require us to drain it in time, would that reduce your need to even orchestrate scale-up on this event? See https://github.com/aws/karpenter-provider-aws/issues/2917

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten