Missing efs.csi.aws.com-reg.sock file on EKS Node.

ryanhockstad commented 7 months ago

/kind bug

What happened? When deploying the aws-efs-csi-driver helm chart, as the efs-csi-node daemonset spins up, certain pods get stuck in a CrashLoopBackOff state. The logs for the efs-plugin container look normal:

I1201 20:11:53.735781       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I1201 20:11:53.736836       1 metadata.go:63] getting MetadataService...
I1201 20:11:53.738274       1 metadata.go:68] retrieving metadata from EC2 metadata service
I1201 20:11:53.831426       1 driver.go:140] Did not find any input tags.
I1201 20:11:53.831724       1 driver.go:113] Registering Node Server
I1201 20:11:53.831742       1 driver.go:115] Registering Controller Server
I1201 20:11:53.831752       1 driver.go:118] Starting efs-utils watchdog
I1201 20:11:53.831846       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I1201 20:11:53.831860       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I1201 20:11:53.832163       1 driver.go:124] Starting reaper
I1201 20:11:53.832182       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}

But the logs for the csi-driver-registrar container just show /usr/bin/csi-node-driver-registrar: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory

Likewise, the logs for the liveness-probe are just: /usr/bin/livenessprobe: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory

Looking at the nodes the failing pods are running on, I've discovered that they do not have the /var/lib/kubelet/plugins_registry/efs.csi.aws.com-reg.sock file.

The pods in the daemonset that do spin up properly do have this file. I'm unsure why this file is missing on some nodes, and I don't know how to configure the helm chart to ensure that this file gets created.

What you expected to happen? I expect all of the pods in efs-csi-node daemonset to spin up properly.

How to reproduce it (as minimally and precisely as possible)? This is unpredictable. I can fix the issue by destroying a node, and when a new node spins up, the /var/lib/kubelet/plugins_registry/efs.csi.aws.com-reg.sock file exists and the pods work as expected.

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): v1.27.7-eks-4f4795d
Driver version: 2.5.0

michaelajr commented 5 months ago

Bumping this. Any info would be very helpful. Seeing this a lot.

sasanknvs commented 4 months ago

In my case, Pods are not initializing due to FailedMount event. When connected to the Node and checked the /var/lib/plugins_registry/ , did not have the "efs.csi.aws.com-reg.sock" file. Checked the logs of "CSI Driver Registrar" the logs look normal. Also brief about the EKS cluster, I have 1 worker node and 2nd node is created dynamically and efs-csi-node daemonset does the required setup.

Also if all the workloads on the static worker node are removed and then a new node is created dynamically, then the file "efs.csi.aws.com-reg.sock" is created properly and volume is mounted successfully.

The same setup I have it in a different cluster which is working pretty fine.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / aws-efs-csi-driver

Missing efs.csi.aws.com-reg.sock file on EKS Node. #1205