Closed dpiddock closed 4 weeks ago
Hi @dpiddock , thank you for bringing this to our attention. We are working together with the author of the following PR to address this issue: https://github.com/kubernetes-sigs/aws-efs-csi-driver/pull/1287
Please let us know if you have any further questions or concerns.
is this same as https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/1273 ?
This is the opposite of #1273. That issue is complaining that the taint is removed too fast, before the service is really ready. This issue is about the startup taint sometimes not being removed because of a race condition.
I've merged https://github.com/kubernetes-sigs/aws-efs-csi-driver/pull/1287 into mainline to address this
/kind bug
What happened? We use Karpenter for the cluster node scaler. We recently implemented the startup taint for the efs driver. Since then we have occasionally seen nodes stuck with unscheduleable pods. Upon investigation it is caused by the
efs.csi.aws.com/agent-not-ready:NoExecute
taint still being present on the node despite theefs-csi-node
pod running correctly.The
efs-plugin
container has this log line:What you expected to happen?
efs-csi-node
pod to successfully remove the node taint so that other pods can be scheduled.How to reproduce it (as minimally and precisely as possible)? We don't currently know. It's a rare and intermittent problem.
Anything else we need to know?: I managed to find the failed request in the audit logs. At the same time there was a request being processed to add the taint
node.kubernetes.io/not-ready:NoExecute
by node-controller. This could be a classic race condition with other parts of the system? Although this is a sample size of 1.I attach the two entries: efs-csi.json node-controller.json
Environment
kubectl version
):v1.29.1-eks-b9c9ed7
v1.7.6
v1.7.6-eksbuild.2
Please also attach debug logs to help us better diagnose
results.tgz