Closed ricardov1 closed 2 weeks ago
Hello @ricardov1, it appears you're using version 1.7.5, which contains a known issue where taints aren't properly removed. This problem was addressed in a PR that was merged into version 2.0.6. Would you mind upgrading to the most recent version and seeing if that resolves your issue?
@mskanth972 thanks! I'll try upgrading. Has a similar fix been implemented for the ebs-csi-driver? I've also seen this behavior there
Should be, they have many fixes regarding the Taint in the latest versions. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md
Closing the issue, please feel free to reopen if the you are seeing the issue still
This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions. A couple of hours later there are still nodes that come up and are stuck with either the ebs or the efs CSI taint.
/reopen
@mamoit: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/kind bug
What happened? When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.
What you expected to happen? After successful daemonset startup, the taint should be removed always.
How to reproduce it (as minimally and precisely as possible)? This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the
Removed taint(s) from local node
log.Anything else we need to know?: This issue appears to be caused by a race condition where the daemonset incorrectly determines that the
efs.csi.aws.com/agent-not-ready
taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length oftaintsToRemove
to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the ebs-csi-driver "beats" the efs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was theebs.csi.aws.com/agent-not-ready
taint that was removed, not theefs.csi.aws.com/agent-not-ready
taint.related issue on ebs-csi-driver repo: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/2199
Environment
kubectl version
): 1.29.8Please also attach debug logs to help us better diagnose