Startup Taint is sometimes not removed

dpiddock commented 2 months ago

/kind bug

What happened? We use Karpenter for the cluster node scaler. We recently implemented the startup taint for the efs driver. Since then we have occasionally seen nodes stuck with unscheduleable pods. Upon investigation it is caused by the efs.csi.aws.com/agent-not-ready:NoExecute taint still being present on the node despite the efs-csi-node pod running correctly.

The efs-plugin container has this log line:

E0419 06:03:32.149207       1 driver.go:134] "Unexpected failure when attempting to remove node taint(s)" err="the server rejected our request due to an error in our request"

What you expected to happen? efs-csi-node pod to successfully remove the node taint so that other pods can be scheduled.

How to reproduce it (as minimally and precisely as possible)? We don't currently know. It's a rare and intermittent problem.

Anything else we need to know?: I managed to find the failed request in the audit logs. At the same time there was a request being processed to add the taint node.kubernetes.io/not-ready:NoExecute by node-controller. This could be a classic race condition with other parts of the system? Although this is a sample size of 1.

I attach the two entries: efs-csi.json node-controller.json

Environment

Kubernetes version (use kubectl version): v1.29.1-eks-b9c9ed7
EKS 1.29
Driver version: v1.7.6
EKS add-on: v1.7.6-eksbuild.2

Please also attach debug logs to help us better diagnose

Instructions to gather debug logs can be found here

results.tgz

seanzatzdev-amazon commented 2 months ago

Hi @dpiddock , thank you for bringing this to our attention. We are working together with the author of the following PR to address this issue: https://github.com/kubernetes-sigs/aws-efs-csi-driver/pull/1287

Please let us know if you have any further questions or concerns.

mteodori commented 1 month ago

is this same as https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/1273 ?

dpiddock commented 1 month ago

This is the opposite of #1273. That issue is complaining that the taint is removed too fast, before the service is really ready. This issue is about the startup taint sometimes not being removed because of a race condition.

seanzatzdev-amazon commented 4 weeks ago

I've merged https://github.com/kubernetes-sigs/aws-efs-csi-driver/pull/1287 into mainline to address this

kubernetes-sigs / aws-efs-csi-driver

Startup Taint is sometimes not removed #1320