Closed rangarb885 closed 2 years ago
Hi.
If I'm right then I believe setting the new --daemonset-eviction-for-occupied-nodes=false
parameter that was introduced in version 1.22
of the cluster-autoscaler might handle the scenario where the daemonsets aren't stopped at all and will not be gracefully stopped but killed when the node terminates.
I'm adding to this case because it is similar enough to the request I was about to make. I would like a way to have the daemonsets stop gracefully, but only after all the "normal" pods have completed. As far as I can tell there is no way to do that?
We have application deployments with a preStop lifecycle to make sure they complete their work before they shutdown. We also have supporting daemonsets such as the node-local-dns, kiam, and fluent-bit running to provide DNS; AWS IAM access; and logging services for the application pods. However when cluster autoscaler chooses a node to scale in, the daemonsets are terminated before the application pods have finished running, resulting in various errors (such as not being able to resolve hostnames for example).
In order to prove that the daemonsets were being terminated too early I set up a test. I created a dedicated instance group, and then the following:
testing-daemonset
testing-balloon
configured like so:
testing-app
configured like so:
After deploying the above it looks like so:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
testing-app-75546f8c56-xfdl8 1/1 Running 0 14s 10.194.49.71 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-balloon-7b5466b9b4-w6sb8 1/1 Running 0 30s 10.194.39.197 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-balloon-7b5466b9b4-zrd29 1/1 Running 0 30s 10.194.37.62 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-czfbr 1/1 Running 0 17m 10.194.49.193 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-f84d9 1/1 Running 0 97s 10.194.47.17 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
I then edit the testing-balloon
deployment so that there is only 1 replica.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
testing-app-75546f8c56-xfdl8 1/1 Running 0 26m 10.194.49.71 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-balloon-7b5466b9b4-zrd29 1/1 Running 0 26m 10.194.37.62 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-czfbr 1/1 Running 0 43m 10.194.49.193 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-f84d9 1/1 Running 0 27m 10.194.47.17 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
I then waited for cluster autoscaler to start a scale in and caught it at this point:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
testing-app-75546f8c56-dbmst 1/1 Running 0 30s 10.194.39.197 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-app-75546f8c56-xfdl8 1/1 Terminating 0 28m 10.194.49.71 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-balloon-7b5466b9b4-zrd29 1/1 Running 0 28m 10.194.37.62 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-czfbr 1/1 Terminating 0 45m 10.194.49.193 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-f84d9 1/1 Running 0 29m 10.194.47.17 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
Here you can see a new testing-app
has started on the same node where the testing-balloon
pod is running.
And the old testing-app
pod is Terminating
, but the testing-daemonset
pod is also Terminating
as well.
So at this point cluster-autoscaler has evicted both the "normal" pods and the daemonsets.
Then a bit later I see this:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
testing-app-75546f8c56-dbmst 1/1 Running 0 39s 10.194.39.197 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-app-75546f8c56-xfdl8 1/1 Terminating 0 28m 10.194.49.71 ip-10-194-40-113.ap-southeast-2.compute.internal <none> <none>
testing-balloon-7b5466b9b4-zrd29 1/1 Running 0 28m 10.194.37.62 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
testing-daemonset-f84d9 1/1 Running 0 30m 10.194.47.17 ip-10-194-54-148.ap-southeast-2.compute.internal <none> <none>
The old testing-daemonset
is gone, but the old testing-app
pod is still there running its preStop.
At this point if it was an application pod that relied on those daemonsets, it would be broken and not able to perform its shutdown tasks properly causing problems.
The above was tested with cluster autoscaler version 1.21
using the following command in the deployment:
- command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=cluster-autoscaler
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/prod3.he0.io
- --balance-similar-node-groups=true
- --expander=least-waste
- --logtostderr=true
- --max-graceful-termination-sec=6000
- --scale-down-delay-after-delete=10m
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --stderrthreshold=info
- --v=4
Is it possible to provide a way for the daemonset evictions to wait until all other pods are gone or in the Completed
state?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
Encountered the same issue.
We have pods that have preStop hooks with sleep commands. In our case, we have statefulsets that that depend on aws-ebs-csi
daemon set pods to de-attach and un-mount the volumes.
When CA scales down nodes all the pods are evicted including the ebs-csi-node
pods, while our statefulsets are stuck in terminating
state since they cant un-mount the attached volumes without ebs-csi-node
pod.
From the previous comment I see mention of the --daemonset-eviction-for-occupied-nodes=false
. We will try it, but as the comment said, graceful shutdown of daemonsets would be preferable instead of killing them.
If anyone has solved this issue, feel free to comment here, I would greatly appreciate it.
This can now be solved using --drain-priority-config
flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).
This can now be solved using
--drain-priority-config
flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).
Thank you for pointing this out.
Hello : I want to understand how to handle the below scenario with CA.
When EKS CA decides to scale down a node (which is a part of managed node-group) which has a daemonset like fluent-bit (shipping logs from apps) & SignalFx (tracing & metrics), what configuration i need to have on CA to make sure that daemonset are not evicted as app may be using this during this scaling down time (under their graceful timeout window).?
Is there a config on CA setup to skip the daemonset eviction and allow them to run until the node is terminated. I am good even if these daemonset are not gracefully stopped as app using them are gracefully stopping (with their own graceful shutdown timeouts)
My current CA configuration (EKS1.21)
Image :
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
Thank you Balaji