Open hamishforbes opened 1 year ago
When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.
This is a bit confusing to me -- the term cluster
appears to mean consul cluster
, but isn't always disambiguated from kubernetes cluster
. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?
Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).
Every node we terminate undergoes the following process:
It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?
This is a bit confusing to me -- the term
cluster
appears to meanconsul cluster
, but isn't always disambiguated fromkubernetes cluster
. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?
Yes correct, sorry should've been clearer when i was referring to the kube cluster and when the consul cluster. Overloaded terms!
Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).
Every node we terminate undergoes the following process:
- Taint the node
- Identify evictable pods
- Pods in "Succeeded" or "Failed" are ignored.
- Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored
- Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop
It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?
The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.
In the specific cases I've seen the node is considered empty because the DaemonSet pods are filtered out So they jump straight to being deleted.
But even if the node were being replaced Karpenter won't explicitly evict DaemonSet pods, a very similar problem to the kubectl drain
issue I linked to.
The 'classic' cluster autoscaler also suffers from this issue
Just deleting the node might be fine? Does the kubelet send a termination signal to all the pods when a node is deleted? I can't seem to find out if this is true or not. Or is there a race and Karpenter is then terminating the EC2 instance before the pods can be gracefully shut down?
Applying a NoExecute
taint with a custom key to the node before deleting it ensures the Consul pods are gracefully terminated before the node is removed from the Kubernetes cluster and before the EC2 instance is terminated.
On second glance I have Kubernetes nodes created by Karpenter that have correctly and gracefully left the Consul cluster, but they were nodes that did actual work.
And other Karpenter provisioned Kubernetes nodes that are failed
in the Consul cluster, but they belong to nodes that had a lifetime of less than 2 minutes (separate problem, i think solved by increasing the batch duration).
I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (https://github.com/kubernetes/kubernetes/issues/107043#issuecomment-997546019).
Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.
The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.
Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.
I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (kubernetes/kubernetes#107043 (comment)).
Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.
Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.
In addition to having a systemd version that actually works with this the shutdownGracePeriod
would need to be changed from its default 0
.
Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD?
So the only option currently would be messing around with the UserData in the AWSNodeTemplate
?
Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.
Yeah daemonsets magically tolerate taints for unschedulable / not-ready etc.
The tolerations aren't in the spec for the DS but are for the pods, which is why the NTH can add a custom taint that isn't tolerated.
> kubectl get ds consul-consul-client -oyaml | ggrep -A 5 tolerations
tolerations:
- key: eck_cluster
operator: Exists
- key: prometheus
operator: Exists
volumes:
> kubectl get pod consul-consul-client-vp2ph -oyaml | grep -A 26 tolerations
tolerations:
- key: eck_cluster
operator: Exists
- key: prometheus
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.
In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0. Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD? So the only option currently would be messing around with the UserData in the AWSNodeTemplate?
It's pretty trivial to plumb the config through the AWSNodeTemplate, but we'll probably wait on that until the EKS Optimized AMI supports the configuration which will probably be when we migrate to AL2022. I don't believe Bottlerocket supports it either.
It's pretty easy to patch it within your user-data in the AWSNodeTemplate if you wanted to try it with Ubuntu or build your own AMI with an updated Systemd.
I think this would work (although haven't tested yet):
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: graceful-shutdown
spec:
amiFamily: Ubuntu
subnetSelector:
karpenter.sh/discovery: my-cluster
securityGroupSelector:
karpenter.sh/discovery: my-cluster
userData: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY"
--BOUNDARY
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
echo "$(jq '.shutdownGracePeriod="2m"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
--BOUNDARY--
I think we'd also like to taint nodes the moment that Karpenter thinks those nodes have become eligible for consolidation.
This lets us quickly untaint them if we see unschedulable Pods that we think might fit there. Otherwise, we'd leave the taint in place through the node drain, shutdown and termination.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Tell us about your request
When removing nodes due to consolidation I would like to be able to apply a taint to the node before it is removed.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Reason for this is to be able to gracefully stop DaemonSet pods, see related issues below
I have consul agents running on nodes via DaemonSet, these agents join the cluster. If they are just killed then they sit around in the cluster as
failed
, if the pod is given a stop signal then it will gracefully leave the cluster and then exit.When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster. Applying a
NoExecute
taint prior to deletion will evict those pods.System DaemonSets (e.g. Kube-proxy) tolerate all taints and so this won't evict those.
Are you currently working around this issue?
Without Karpenter nodes are generally only removed a) Manually, in which case I manually taint the node with a
noExecute
taint b) By the node-termination-handler which is configured to add a taint as wellWith Karpenter... well the workaround is to manually clear out failed nodes from my Consul cluster or get this feature added!
Additional Context
https://github.com/aws/aws-node-termination-handler/issues/273 https://github.com/kubernetes/kubernetes/issues/75482
Attachments
No response
Community Note