kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
693 stars 524 forks source link

Upgrade from Chart 2.4.4 #1372

Open m-parrella opened 3 weeks ago

m-parrella commented 3 weeks ago

/kind bug

What happened?

We recently upgraded our EKS cluster to 1.29. We are using Managed Nodes with amazon-eks-node-1.29-v20240227 AMI and we are using the EFS CSI Driver 1.5.6 deployed by Helm. Chart 2.4.4.

Following an upgrade of the driver from Chart 2.4.4 to Chart 2.4.5 (or higher), we encountered an issue where deployments using the EFS Storage Class ceased functioning correctly. Both Pods and Nodes failed to respond to the 'df' command. In examining /var/log/messages on the node, we found the following error message:

Jun 10 15:07:44 ip-XXX-XXX-XXX-XXX kernel: nfs: server 127.0.0.1 not responding, still trying

If we move the Pods mounting EFS volumenes to a new node, the Pod runs as expected.

Upon comparing both charts, the significant alteration lies in the EFS State Directory as outlined in the CHANGELOG. This leads us to suspect that stunnel may not be capable of resuming connections post-upgrade.

{
  "hostPath": {
    "path": "/var/run/efs",
    "type": "DirectoryOrCreate"
  },
  "name": "efs-state-dir"
}

To avoid refreshing the nodes, we have identified two workarounds. The first approach involves patching the DaemonSet to utilize the original path. This can be achieved by executing the following command:

kubectl patch daemonsets -n kube-system efs-csi-node --type json -p='[{"op": "replace", "path": "/spec/template/spec/volumes/3/hostPath/path", "value": "/var/run/efs-csi-driver"}]'

The second approach it to create a symbolic link prior the upgrade:

[root@ip-XXX-XXX-XXX-XXX /]# ln -s /var/run/efs-csi-driver /var/run/efs
[root@ip-XXX-XXX-XXX-XXX /]# ls -ld /var/run/efs /var/run/efs-csi-driver
lrwxrwxrwx 1 root root  23 Jun 10 18:15 /var/run/efs -> /var/run/efs-csi-driver
drwxr-xr-x 4 root root 160 Jun 10 18:21 /var/run/efs-csi-driver

Is this the expected behavior? Thanks in advance!

What you expected to happen?

Containers volumes should remain operational after the upgrade.

How to reproduce it (as minimally and precisely as possible)?

Upgrade from Chart 2.4.4 to Chart 2.4.5 or higher using Helmfile.