kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
580 stars 195 forks source link

support drain timeout #743

Closed grosser closed 3 months ago

grosser commented 2 years ago

Tell us about your request Sometimes nodes get stuck when a pod could not be drained, so we have to alert ourselves and then manually kill it, which is not ideal.

It would be nice to set an optional drainTimeout so we can ensure nodes always die after x hours/days.

Are you currently working around this issue? Helper pod that kills pods on stuck nodes so they finish draining.

Community Note

spring1843 commented 2 years ago

What should happen after the drainTimeout has elapsed? A pod not terminating may be by design, e.g. it's waiting for an IO operation with a long timeout.

grosser commented 2 years ago

it should force-delete the pod the flag will definitely need "this has xyz downsides" warning but dealing with empty nodes hanging around is also an overhead and especially in staging clusters we want to make sure cost is under control when things go wrong

jonathan-innis commented 2 years ago

Related to aws/karpenter#2391

ellistarn commented 1 year ago

@grosser, can you elaborate a bit on what can cause the node to get stuck? Do you have do-not-evict pods, or broken PDBs?. We're currently scoping some work in this area and looking to see what makes sense here.

grosser commented 1 year ago

broken pdbs we as cluster admins need to roll the clusters, but some other teams left their apps in a bad state in production that would mean we have to reach out to them or at least check that their app is not critical, but in staging we'd like to just ignore it

ellistarn commented 1 year ago

I wonder if we should just hardcode something like 10 minutes if eviction is returning a 500. Do you need to configure this value?

grosser commented 1 year ago

I don't think it's good to hardcode any timeout since by default the system should be safe for the workloads and fail the cluster rollout instead. The opt-in could be either a timeout or true/false ... but I think timeout makes the most sense.

ellistarn commented 1 year ago

What would you set it to?

grosser commented 1 year ago

nothing in production and 10-15m in staging

ellistarn commented 1 year ago

The other wrinkle here is that we won't attempt (iirc, or perhaps it's just deprioritized) to deprovision a node that has a blocked PDB, so we'll need handling for both disruption and termination flows.

ellistarn commented 1 year ago

Would you want to treat a 500 (misconfigured) differently than a 429 (pdb unfulfilled)?

grosser commented 1 year ago

I'd treat them the same, either way: production: the owning team needs to check it out before we can proceed staging: go fix your app 🤷

ellistarn commented 1 year ago

Great feedback, thank you!

sftim commented 1 year ago

Related / extra credit: manage machine drains through a custom resource, similar to how the node maintenance operator does it. This could include reporting progress via .status

njtran commented 1 year ago

How does https://kubernetes.io/blog/2023/01/06/unhealthy-pod-eviction-policy-for-pdbs/ fit into your use-case? Does it solve it? It's behind a Beta feature gate in 1.27.

grosser commented 1 year ago

it solves it for completely unhealthy applications and should not have a big impact for regular operations it however does not solve the situation when only part of a pdb is unhealthy (for example new pods are unhealthy and old pods are healthy)

mallow111 commented 1 year ago

@jonathan-innis does anyone start to work on this issue yet?

garvinp-stripe commented 1 year ago

Is it possible to expose and implement a timeout in the finalizer? If nothing is set (default) then it never times out but if it exist it should be easy enough for the finalizer to simply annotation the machine/ node with drain started with and if that now - annotation > timeout -> force kill the node/ machine?

sftim commented 1 year ago

I found out recently that graceful deletion for custom resources (as in CRDs) doesn't really seem to be a thing. If that's right then we'd need an improvement to Kubernetes itself, and then could indeed gracefully delete the Machine, using the future-dated removal to signal the drain timeout.

garvinp-stripe commented 1 year ago

Can you provide some details? I think you are correct however shouldn't finalizer on CRD (machine) work equally as well? Trying to understand what specific API you need from K8s

sftim commented 1 year ago

Graceful deletion has an associated time; finalizers don't have any time stored in the API.

garvinp-stripe commented 1 year ago

Ah ya that's unfortunate

garvinp-stripe commented 1 year ago

I think to upstream that change would take a while. I wonder if there is a mid to short term solution? I believe we might just fork Karpenter core and add the ability into the finalizer that picks up the timeout from the config map and some kind of annotation on the machine to mark it with time deletion attempt

njtran commented 1 year ago

We had talked about including a timeout here that would infer from the NodePool or node itself to control the drain/eviction logic and forcefully evict once a drain/termination has reached the timeout. We'd essentially need to mock the graceful deletion ourselves, as I don't think there's any graceful deletion for nodes. We had a PR https://github.com/aws/karpenter-core/pull/466 that explored this, but haven't had the time to get to it.

wmgroot commented 11 months ago

Just for reference, this is the field that supports this behavior in ClusterAPI.

$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout
KIND:     MachineDeployment
VERSION:  cluster.x-k8s.io/v1beta1

FIELD:    nodeDrainTimeout <string>

DESCRIPTION:
     NodeDrainTimeout is the total amount of time that the controller will spend
     on draining a node. The default value is 0, meaning that the node can be
     drained without any time limitations. NOTE: NodeDrainTimeout is different
     from `kubectl drain --timeout`

In our case, we set this value to 24h in all clusters, production or otherwise. We make it clear to our users that they must tolerate graceful disruption of their pods, and to alert themselves if their pods are in an unhealthy state for a long period of time in production. This allows us to ensure we can keep up to date with security patches across all of our clusters.

We've discussed lowering the value from 24h to something much shorter once we have a better story for automated traffic failover between our clusters.

wmgroot commented 11 months ago

Should support for volume detach timeouts be opened as another issue? We use EBS CSI and find that some volumes refuse to detach due to timeouts that aren't handled well by the EBS CSI controller. We've opened several issues there, but these problems still occur after years of using EBS CSI. This blocks removal of nodes and can greatly delay the process of updating nodes in a cluster if left to manual intervention.

$ kubectl explain machinedeployment.spec.template.spec.nodeVolumeDetachTimeout --context mgmt-use2
KIND:     MachineDeployment
VERSION:  cluster.x-k8s.io/v1beta1

FIELD:    nodeVolumeDetachTimeout <string>

DESCRIPTION:
     NodeVolumeDetachTimeout is the total amount of time that the controller
     will spend on waiting for all volumes to be detached. The default value is
     0, meaning that the volumes can be detached without any time limitations.
garvinp-stripe commented 10 months ago

I think in general there is a need/ desire for more control over how nodes shut down happens from cluster operators from timeouts and how drain happens.

Wondering if we should prioritize https://github.com/kubernetes-sigs/karpenter/issues/740 to expand how node termination happens and exposing more control for users?

ellistarn commented 10 months ago

Supportive of a node drain timeout proposal. Anyone want to take this on?

wmgroot commented 10 months ago

I'm interested, opened a PR for a design on this above.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

garvinp-stripe commented 7 months ago

/remove-lifecycle stale

jonathan-innis commented 7 months ago

cc: @akestner

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jonathan-innis commented 4 months ago

/remove-lifecycle stale