Taint removal race condition

ricardov1 commented 3 weeks ago

/kind bug

What happened? When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.

What you expected to happen? After successful daemonset startup, the taint should be removed always.

How to reproduce it (as minimally and precisely as possible)? This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the Removed taint(s) from local node log.

Anything else we need to know?: This issue appears to be caused by a race condition where the daemonset incorrectly determines that the efs.csi.aws.com/agent-not-ready taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length of taintsToRemove to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the ebs-csi-driver "beats" the efs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was the ebs.csi.aws.com/agent-not-ready taint that was removed, not the efs.csi.aws.com/agent-not-ready taint.

related issue on ebs-csi-driver repo: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/2199

Environment

Kubernetes version (use kubectl version): 1.29.8
Driver version: 1.7.5

Please also attach debug logs to help us better diagnose

Instructions to gather debug logs can be found here

mskanth972 commented 2 weeks ago

Hello @ricardov1, it appears you're using version 1.7.5, which contains a known issue where taints aren't properly removed. This problem was addressed in a PR that was merged into version 2.0.6. Would you mind upgrading to the most recent version and seeing if that resolves your issue?

ricardov1 commented 2 weeks ago

@mskanth972 thanks! I'll try upgrading. Has a similar fix been implemented for the ebs-csi-driver? I've also seen this behavior there

mskanth972 commented 2 weeks ago

Should be, they have many fixes regarding the Taint in the latest versions. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md

mskanth972 commented 2 weeks ago

Closing the issue, please feel free to reopen if the you are seeing the issue still

mamoit commented 2 weeks ago

This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions. A couple of hours later there are still nodes that come up and are stuck with either the ebs or the efs CSI taint.

mamoit commented 1 week ago

/reopen

k8s-ci-robot commented 1 week ago

@mamoit: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/1491#issuecomment-2467929412): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / aws-efs-csi-driver

Taint removal race condition #1491