Daemonset Eviction during Scale down

rangarb885 commented 3 years ago

Hello : I want to understand how to handle the below scenario with CA.

When EKS CA decides to scale down a node (which is a part of managed node-group) which has a daemonset like fluent-bit (shipping logs from apps) & SignalFx (tracing & metrics), what configuration i need to have on CA to make sure that daemonset are not evicted as app may be using this during this scaling down time (under their graceful timeout window).?
Is there a config on CA setup to skip the daemonset eviction and allow them to run until the node is terminated. I am good even if these daemonset are not gracefully stopped as app using them are gracefully stopping (with their own graceful shutdown timeouts)

My current CA configuration (EKS1.21)

Image : k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0

            "./cluster-autoscaler",
            "--v=2",
            "--stderrthreshold=2",
            "--cloud-provider=aws",
            "--scan-interval=10s",
            "--skip-nodes-with-local-storage=false",
            "--aws-use-static-instance-list=true",
            "--expander=least-waste",
            "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${var.cluster_name}"

Thank you Balaji

jim-barber-he commented 3 years ago

Hi.

If I'm right then I believe setting the new --daemonset-eviction-for-occupied-nodes=false parameter that was introduced in version 1.22 of the cluster-autoscaler might handle the scenario where the daemonsets aren't stopped at all and will not be gracefully stopped but killed when the node terminates.

I'm adding to this case because it is similar enough to the request I was about to make. I would like a way to have the daemonsets stop gracefully, but only after all the "normal" pods have completed. As far as I can tell there is no way to do that?

We have application deployments with a preStop lifecycle to make sure they complete their work before they shutdown. We also have supporting daemonsets such as the node-local-dns, kiam, and fluent-bit running to provide DNS; AWS IAM access; and logging services for the application pods. However when cluster autoscaler chooses a node to scale in, the daemonsets are terminated before the application pods have finished running, resulting in various errors (such as not being able to resolve hostnames for example).

In order to prove that the daemonsets were being terminated too early I set up a test. I created a dedicated instance group, and then the following:

A daemonset called testing-daemonset
A deployment called testing-balloon configured like so:
- 2 replicas
- pod affinity so they are required to run together
- annotation to tell cluster-autoscaler it is not allowed to evict them
- resources tuned so that the 2x pods take up most of the memory on a node.
A deployment called testing-app configured like so:
- resources tuned so that it needs enough memory that it won't fit on the node with the 2x testing-balloon pods.
- a prestop lifecycle to sleep for 5 mins.

After deploying the above it looks like so:

$ kubectl get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-xfdl8       1/1     Running   0          14s   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-w6sb8   1/1     Running   0          30s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running   0          30s   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Running   0          17m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running   0          97s   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

I then edit the testing-balloon deployment so that there is only 1 replica.

NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-xfdl8       1/1     Running   0          26m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running   0          26m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Running   0          43m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running   0          27m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

I then waited for cluster autoscaler to start a scale in and caught it at this point:

NAME                               READY   STATUS        RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-dbmst       1/1     Running       0          30s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-app-75546f8c56-xfdl8       1/1     Terminating   0          28m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running       0          28m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-czfbr            1/1     Terminating   0          45m   10.194.49.193   ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running       0          29m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

Here you can see a new testing-app has started on the same node where the testing-balloon pod is running. And the old testing-app pod is Terminating, but the testing-daemonset pod is also Terminating as well. So at this point cluster-autoscaler has evicted both the "normal" pods and the daemonsets.

Then a bit later I see this:

NAME                               READY   STATUS        RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
testing-app-75546f8c56-dbmst       1/1     Running       0          39s   10.194.39.197   ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-app-75546f8c56-xfdl8       1/1     Terminating   0          28m   10.194.49.71    ip-10-194-40-113.ap-southeast-2.compute.internal   <none>           <none>
testing-balloon-7b5466b9b4-zrd29   1/1     Running       0          28m   10.194.37.62    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>
testing-daemonset-f84d9            1/1     Running       0          30m   10.194.47.17    ip-10-194-54-148.ap-southeast-2.compute.internal   <none>           <none>

The old testing-daemonset is gone, but the old testing-app pod is still there running its preStop. At this point if it was an application pod that relied on those daemonsets, it would be broken and not able to perform its shutdown tasks properly causing problems.

The above was tested with cluster autoscaler version 1.21 using the following command in the deployment:

      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=cluster-autoscaler
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/prod3.he0.io
        - --balance-similar-node-groups=true
        - --expander=least-waste
        - --logtostderr=true
        - --max-graceful-termination-sec=6000
        - --scale-down-delay-after-delete=10m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=info
        - --v=4

Is it possible to provide a way for the daemonset evictions to wait until all other pods are gone or in the Completed state?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4337#issuecomment-1053990483): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

RicardsRikmanis commented 2 years ago

Encountered the same issue.

We have pods that have preStop hooks with sleep commands. In our case, we have statefulsets that that depend on aws-ebs-csi daemon set pods to de-attach and un-mount the volumes.

When CA scales down nodes all the pods are evicted including the ebs-csi-node pods, while our statefulsets are stuck in terminating state since they cant un-mount the attached volumes without ebs-csi-node pod.

From the previous comment I see mention of the --daemonset-eviction-for-occupied-nodes=false. We will try it, but as the comment said, graceful shutdown of daemonsets would be preferable instead of killing them.

If anyone has solved this issue, feel free to comment here, I would greatly appreciate it.

x13n commented 4 months ago

This can now be solved using --drain-priority-config flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).

jim-barber-he commented 4 months ago

This can now be solved using --drain-priority-config flag to evict lower priority pods first (assuming DS are higher priority which generally is a reasonable setup).

Thank you for pointing this out.

kubernetes / autoscaler

Daemonset Eviction during Scale down #4337