kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
637 stars 206 forks source link

Feature enforceExpireAfter - respect to expireAfter ttl #1789

Open ArieLevs opened 3 weeks ago

ArieLevs commented 3 weeks ago

Description

Karpenter has the expireAfter feature, lets assume I've configures a 24h value for expireAfter, 24 hours passed and node should be terminated, but karpenter will result in

  Type     Reason             Age                  From       Message
  ----     ------             ----                 ----       -------
  Warning  FailedDraining     27m (x533 over 18h)  karpenter  Failed to drain node, 7 pods are waiting to be evicted
  Normal   DisruptionBlocked  67s (x525 over 18h)  karpenter  Cannot disrupt Node: state node is marked for deletion

this is because one or more of the pods from that node have the karpenter.sh/do-not-disrupt: true annotation. the result is a node that taint with karpenter.sh/disrupted:NoSchedule, no new pods will jump onto it, and its in a "stuck" situation.

I would like a way to force karpenter to spin new nodes even if I use this annotations. would it be reasonable to add a enforceExpireAfter: true|false (default false) future, so if true is set, karpenter will ignore/remove the do not disrupt annotation and just delete the node?

What problem are you trying to solve? forcefully delete nodes after TTL of expireAfter passed

How important is this feature to you? very, the lack of this features results in underutilized nodes that cannot be auto deleted


jmdeal commented 3 weeks ago

Have you considered using terminationGracePeriod?

ArieLevs commented 3 weeks ago

Have you considered using terminationGracePeriod?

yes, we use a 2h terminationGracePeriod value, this does not works, probably since a delete call is not even made against the node (its just "marked for deletion" by Karpenter)

ArieLevs commented 3 weeks ago

@jmdeal after reading again the documentation from TerminationGracePeriod,

it states:

For instance, a NodeClaim with terminationGracePeriod set to 1h and an expireAfter set to 23h will begin draining after it’s lived for 23h. Let’s say a do-not-disrupt pod has TerminationGracePeriodSeconds set to 300 seconds. If the node hasn’t been fully drained after 55m, Karpenter will delete the pod to allow it’s full terminationGracePeriodSeconds to cleanup. If no pods are blocking draining, Karpenter will cleanup the node as soon as the node is fully drained, rather than waiting for the NodeClaim’s terminationGracePeriod to finish.

so in my case: terminationGracePeriod set to 1h - true expireAfter set to 23h - true a do-not-disrupt pod has TerminationGracePeriodSeconds set to 300 seconds - true (0 seconds) but, Karpenter will delete the pod to allow it’s full terminationGracePeriodSeconds to cleanup - false

should this issue changed to a bug instead of a feature request?

jmdeal commented 2 weeks ago

Yes, if the node has been draining for longer than your terminationGracePeriod, this would be a bug not a feature. TGP should enforce a maximum grace time which should meet your use case. Are you able to share Karpenter logs / events that were emited?

/kind bug /triage needs-information

jmdeal commented 2 weeks ago

/remove-kind feature

ArieLevs commented 2 weeks ago

Sure 👍, will add logs from historical data by early next week (but will probably going to have fresh info from Sunday/Monday)

thanks

ArieLevs commented 2 weeks ago

an example from today for a node part of a NodePool with expireAfter: 720h (30 days), node is alive for 77d

Events:
  Type     Reason             Age                      From       Message
  ----     ------             ----                     ----       -------
  Warning  FailedDraining     5m54s (x1561 over 2d4h)  karpenter  Failed to drain node, 8 pods are waiting to be evicted
  Normal   DisruptionBlocked  58s (x1510 over 2d4h)    karpenter  Cannot disrupt Node: state node is marked for deletion

endless logs of:

{"body":"Failed to drain node, 8 pods are waiting to be evicted","severity":"Warning","attributes":{"k8s.event.action":"","k8s.event.count":1546,"k8s.event.name":"ip-10-235-51-74.ec2.internal.1804f311a19b7e76","k8s.event.reason":"FailedDraining","k8s.event.start_time":"2024-11-07 00:17:41 +0000 UTC","k8s.event.uid":"73e98ab4-b698-4f16-90f1-db050a48d744","k8s.namespace.name":""},"resources":{"k8s.node.name":"","k8s.object.api_version":"v1","k8s.object.fieldpath":"","k8s.object.kind":"Node","k8s.object.name":"ip-10-235-51-74.ec2.internal","k8s.object.resource_version":"965976639","k8s.object.uid":"5abd9407-e06d-4e80-b0db-c48444e4f414"}}

{"body":"Cannot disrupt Node: state node is marked for deletion","severity":"Normal","attributes":{"k8s.event.action":"","k8s.event.count":1501,"k8s.event.name":"ip-10-235-51-74.ec2.internal.1804f3123cd3ae94","k8s.event.reason":"DisruptionBlocked","k8s.event.start_time":"2024-11-06 02:22:17 +0000 UTC","k8s.event.uid":"4d6ddbff-1a4b-4fa7-8eb3-dd8ba0c37753","k8s.namespace.name":""},"resources":{"k8s.node.name":"","k8s.object.api_version":"v1","k8s.object.fieldpath":"","k8s.object.kind":"Node","k8s.object.name":"ip-10-235-51-74.ec2.internal","k8s.object.resource_version":"963917187","k8s.object.uid":"5abd9407-e06d-4e80-b0db-c48444e4f414"}}

this node contains 8 pods, 7 of them are daemonsets, and single deployment pod that contains the karpenter.sh/do-not-disrupt: true annotation, this pod contains next event:

Events:
  Type    Reason     Age                   From       Message
  ----    ------     ----                  ----       -------
  Normal  Nominated  29m (x12 over 3h32m)  karpenter  Pod should schedule on: nodeclaim/default-jv5sb, node/NODE-A
  Normal  Nominated  3m2s (x806 over 27h)  karpenter  Pod should schedule on: nodeclaim/default-c8h6l, node/NODE-B

note that the nodes, the above pod was scheduled for (i.e. NODE-A and NODE-B) contain next events (they both live just less then 4 days):

Events:
  Type    Reason             Age                    From       Message
  ----    ------             ----                   ----       -------
  Normal  DisruptionBlocked  75s (x1511 over 2d4h)  karpenter  Cannot disrupt Node: state node is nominated for a pending pod