Closed RyanStan closed 4 weeks ago
We are also seeing this issue when upgrading from 1.6.0 to 1.7.5
We are also seeing this issue when upgrading from 1.6.0 to 1.7.2, any resolution for this?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We were able to resolve this by releasing the EFS v1.7.2 by setting the UpdateStrategy type to OnDelete(to avoid EFS CSI v1.6.0 daemonset pods to restart) and then rotating all the nodes in the cluster so that the new nodes have the EFS CSI daemonset v1.7.2.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/kind bug
Issue discovered with v1.6.0 of the aws-efs-csi-driver
When the EFS client mounts a file system, we re-direct a local NFS mount from the Linux Kernel to localhost, and then use a proxy process, stunnel, to receive the NFS traffic and forward it to EFS. The stunnel process runs in the efs-csi-node Pods.
Version v1.6.0 of the csi driver switched
hostNetwork=true
tohostNetwork=false
. This means that Pods in the efs-csi-node Daemonset will launch into a new network namespace whenever they are restarted. This causes an issue. Any time these Pods are restarted, stunnel will launch in a new network namespace, while the local NFS mount from the kernel to localhost remains in the previous network namespace. This causes the mount to hang because the localhost NFS mount will not be able to reach the stunnel process once the Pod has restarted. When mounts hang, they go into uninterruptible sleep.The issue was resolved in v1.7.0 of the driver, where we reverted the
hostNetwork
change, and sethostNetwork=true
again. Thus, this issue only affects customers that established mounts while using v1.6.0 of the csi driver.Work-arounds
Any attempts to upgrade or restart the v1.6.0 of the efs-csi-node Daemonset will result in EFS mounts on the node hanging.
To work-around this issue, you can launch new EKS nodes into your cluster, and then deploy a new efs-csi-node Daemonset, with
hostNetwork=true
, that targets these new nodes using Kubernetes Selectors. A rolling migration of your application to these new Nodes will allow you to upgrade to a new aws-efs-csi-driver version while ensuring that your application doesn't experience any downtime due to hanging mounts.This issue was originally discovered here, but I'm making this post to raise visibility.