kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
559 stars 185 forks source link

Taint nodes before deletion #621

Open hamishforbes opened 1 year ago

hamishforbes commented 1 year ago

Tell us about your request

When removing nodes due to consolidation I would like to be able to apply a taint to the node before it is removed.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Reason for this is to be able to gracefully stop DaemonSet pods, see related issues below

I have consul agents running on nodes via DaemonSet, these agents join the cluster. If they are just killed then they sit around in the cluster as failed, if the pod is given a stop signal then it will gracefully leave the cluster and then exit.

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster. Applying a NoExecute taint prior to deletion will evict those pods.

System DaemonSets (e.g. Kube-proxy) tolerate all taints and so this won't evict those.

Are you currently working around this issue?

Without Karpenter nodes are generally only removed a) Manually, in which case I manually taint the node with a noExecute taint b) By the node-termination-handler which is configured to add a taint as well

With Karpenter... well the workaround is to manually clear out failed nodes from my Consul cluster or get this feature added!

Additional Context

https://github.com/aws/aws-node-termination-handler/issues/273 https://github.com/kubernetes/kubernetes/issues/75482

Attachments

No response

Community Note

ellistarn commented 1 year ago

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.

This is a bit confusing to me -- the term cluster appears to mean consul cluster, but isn't always disambiguated from kubernetes cluster. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?

Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).

Every node we terminate undergoes the following process:

  1. Taint the node
  2. Identify evictable pods
  3. Pods in "Succeeded" or "Failed" are ignored.
  4. Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored
  5. Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop

It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?

hamishforbes commented 1 year ago

This is a bit confusing to me -- the term cluster appears to mean consul cluster, but isn't always disambiguated from kubernetes cluster. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?

Yes correct, sorry should've been clearer when i was referring to the kube cluster and when the consul cluster. Overloaded terms!

Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).

Every node we terminate undergoes the following process:

  1. Taint the node
  2. Identify evictable pods
  3. Pods in "Succeeded" or "Failed" are ignored.
  4. Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored
  5. Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop

It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?

The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.

In the specific cases I've seen the node is considered empty because the DaemonSet pods are filtered out So they jump straight to being deleted.

But even if the node were being replaced Karpenter won't explicitly evict DaemonSet pods, a very similar problem to the kubectl drain issue I linked to. The 'classic' cluster autoscaler also suffers from this issue

Just deleting the node might be fine? Does the kubelet send a termination signal to all the pods when a node is deleted? I can't seem to find out if this is true or not. Or is there a race and Karpenter is then terminating the EC2 instance before the pods can be gracefully shut down?

Applying a NoExecute taint with a custom key to the node before deleting it ensures the Consul pods are gracefully terminated before the node is removed from the Kubernetes cluster and before the EC2 instance is terminated.

On second glance I have Kubernetes nodes created by Karpenter that have correctly and gracefully left the Consul cluster, but they were nodes that did actual work. And other Karpenter provisioned Kubernetes nodes that are failed in the Consul cluster, but they belong to nodes that had a lifetime of less than 2 minutes (separate problem, i think solved by increasing the batch duration).

bwagner5 commented 1 year ago

I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (https://github.com/kubernetes/kubernetes/issues/107043#issuecomment-997546019).

Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.

ellistarn commented 1 year ago

The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.

Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.

hamishforbes commented 1 year ago

I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (kubernetes/kubernetes#107043 (comment)).

Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.

Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.

In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0. Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD? So the only option currently would be messing around with the UserData in the AWSNodeTemplate?

Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.

Yeah daemonsets magically tolerate taints for unschedulable / not-ready etc.

The tolerations aren't in the spec for the DS but are for the pods, which is why the NTH can add a custom taint that isn't tolerated.

> kubectl get ds consul-consul-client -oyaml | ggrep -A 5 tolerations
      tolerations:
      - key: eck_cluster
        operator: Exists
      - key: prometheus
        operator: Exists
      volumes:
> kubectl get pod consul-consul-client-vp2ph -oyaml | grep -A 26 tolerations
  tolerations:
  - key: eck_cluster
    operator: Exists
  - key: prometheus
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
bwagner5 commented 1 year ago

Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.

In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0. Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD? So the only option currently would be messing around with the UserData in the AWSNodeTemplate?

It's pretty trivial to plumb the config through the AWSNodeTemplate, but we'll probably wait on that until the EKS Optimized AMI supports the configuration which will probably be when we migrate to AL2022. I don't believe Bottlerocket supports it either.

It's pretty easy to patch it within your user-data in the AWSNodeTemplate if you wanted to try it with Ubuntu or build your own AMI with an updated Systemd.

I think this would work (although haven't tested yet):

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: graceful-shutdown
spec:
  amiFamily: Ubuntu
  subnetSelector:
    karpenter.sh/discovery: my-cluster
  securityGroupSelector:
    karpenter.sh/discovery: my-cluster
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

    --BOUNDARY
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo "$(jq '.shutdownGracePeriod="2m"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

    --BOUNDARY--
sftim commented 1 year ago

I think we'd also like to taint nodes the moment that Karpenter thinks those nodes have become eligible for consolidation.

This lets us quickly untaint them if we see unschedulable Pods that we think might fit there. Otherwise, we'd leave the taint in place through the node drain, shutdown and termination.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jmdeal commented 6 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jmdeal commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale