kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
729 stars 554 forks source link

Driver fails to release ports on unmount #281

Closed spohner closed 3 years ago

spohner commented 4 years ago

/kind bug

What happened? We noticed that pods were unable to mount EFS PVCs, and got stuck in ContainerCreating. The logs showed Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf Logged into the node and found with netstat that all 400 ports were populated by stunnel processes.

The watchdog logs shows that it fails to kill processes on unmount. These log lines repeat for several pids. 2020-11-12 16:12:16,530 - INFO - Unmount grace period expired for fs-6a3285fb.var.lib.kubelet.pods.c3015949-346d-42cf-9594-3be561ca30c8.volumes.kubernetes.io~csi.pvc-7ef93798-9182-469f-b35a-72cd13ecfcac.mount.20402 2020-11-12 16:12:16,530 - INFO - Terminating running TLS tunnel - PID: 2773, group ID: 2773 2020-11-12 16:12:16,530 - INFO - TLS tunnel: 2773 is still running, will retry termination

What you expected to happen? Ports are freed upon unmount and pods on all nodes are able to mount EFS PVCs.

How to reproduce it (as minimally and precisely as possible)? Not sure how to reproduce as it seems random.

Anything else we need to know?: We have seen this issue several times, however it seems random when a node fails to release the ports. We experience this in our medium sized cluster maybe once a week. Other nodes are working just fine when this happens. A quick fix is to replace the bad node.

Environment

reyntjensw commented 3 years ago

We are experiencing the same issue in an environment where a lot of pod autoscaling is happening.

From our experience it happens every 7 to 10 days, the quick fix here is to replace all nodes but in a production environment this is not a behavior we want to have.

Right after this issue we have opened up a support case but they requested us to update this issue to start with.

This is the part of the error we are seeing


Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

E0107 09:34:53.755515       1 driver.go:75] GRPC error: rpc error: code = Internal desc = Could not mount "fs-0f780057:/" at "/var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

michaelswierszcz commented 3 years ago

facing a similar issue. the pattern i've identified so far is that when the node reaches a high cpu usage, the efs-csi-driver crashes. if this happens 3+ times the node can no longer mount any efs pv's.

wongma7 commented 3 years ago

/remove-lifecycle rotten

smrutiranjantripathy commented 3 years ago

It is fixed by this PR

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jumping commented 1 year ago

the issue appeared again "Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf".

kubernets: 1.23 aws-efs-csi-driver:v1.3.7 os: AMI 1.23.16-20230304

usulkies commented 1 year ago

I saw the issue today: Failed to locate an available port in the range [20049, 20449]

It should happen any time we have more than 401 pods running on the same node and trying all to mount an EFS volume. A workaround could be setting the maxPods value on the kubelet, but another approach might be allowing a wider range through the helm chart values. Is it possible to let the user set the values for these two values? https://github.com/aws/efs-utils/blob/62fde08f790a1ab50f25b81f85940bec6f4b92e9/src/mount_efs/__init__.py#L959C50-L959C72

balusarakesh commented 1 year ago

this is happening for us too, is there a workaround for this in AWS EKS?

neoakris commented 3 months ago

Encountered on v2.0.1 (released ~April 2024), so might still be a thing. It seems there are related issues https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues?q=is%3Aissue+ports+is%3Aclosed so I'll try updating to the latest version v2.0.7 (as of ~Aug, 2024)

JonTheNiceGuy commented 3 months ago

Glad it's not just me @neoakris!

solival commented 1 week ago

Same issue observed. Just now observed 2 nodes consumed all >1000 ports with only 16 established, and the rest in LISTENING state.