Closed spohner closed 3 years ago
We are experiencing the same issue in an environment where a lot of pod autoscaling is happening.
From our experience it happens every 7 to 10 days, the quick fix here is to replace all nodes but in a production environment this is not a behavior we want to have.
Right after this issue we have opened up a support case but they requested us to update this issue to start with.
This is the part of the error we are seeing
Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
E0107 09:34:53.755515 1 driver.go:75] GRPC error: rpc error: code = Internal desc = Could not mount "fs-0f780057:/" at "/var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o tls fs-0f780057:/ /var/lib/kubelet/pods/cf5d2e26-9462-451b-8dca-4ea1c988feb9/volumes/kubernetes.io~csi/efs-pv-sessions/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
facing a similar issue. the pattern i've identified so far is that when the node reaches a high cpu usage, the efs-csi-driver crashes. if this happens 3+ times the node can no longer mount any efs pv's.
/remove-lifecycle rotten
It is fixed by this PR
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
the issue appeared again "Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf".
kubernets: 1.23 aws-efs-csi-driver:v1.3.7 os: AMI 1.23.16-20230304
I saw the issue today: Failed to locate an available port in the range [20049, 20449]
It should happen any time we have more than 401 pods running on the same node and trying all to mount an EFS volume. A workaround could be setting the maxPods value on the kubelet, but another approach might be allowing a wider range through the helm chart values. Is it possible to let the user set the values for these two values? https://github.com/aws/efs-utils/blob/62fde08f790a1ab50f25b81f85940bec6f4b92e9/src/mount_efs/__init__.py#L959C50-L959C72
this is happening for us too, is there a workaround for this in AWS EKS?
Encountered on v2.0.1 (released ~April 2024), so might still be a thing. It seems there are related issues https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues?q=is%3Aissue+ports+is%3Aclosed so I'll try updating to the latest version v2.0.7 (as of ~Aug, 2024)
Glad it's not just me @neoakris!
Same issue observed. Just now observed 2 nodes consumed all >1000 ports with only 16 established, and the rest in LISTENING state.
/kind bug
What happened? We noticed that pods were unable to mount EFS PVCs, and got stuck in ContainerCreating. The logs showed
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Logged into the node and found with netstat that all 400 ports were populated by stunnel processes.The watchdog logs shows that it fails to kill processes on unmount. These log lines repeat for several pids.
2020-11-12 16:12:16,530 - INFO - Unmount grace period expired for fs-6a3285fb.var.lib.kubelet.pods.c3015949-346d-42cf-9594-3be561ca30c8.volumes.kubernetes.io~csi.pvc-7ef93798-9182-469f-b35a-72cd13ecfcac.mount.20402 2020-11-12 16:12:16,530 - INFO - Terminating running TLS tunnel - PID: 2773, group ID: 2773 2020-11-12 16:12:16,530 - INFO - TLS tunnel: 2773 is still running, will retry termination
What you expected to happen? Ports are freed upon unmount and pods on all nodes are able to mount EFS PVCs.
How to reproduce it (as minimally and precisely as possible)? Not sure how to reproduce as it seems random.
Anything else we need to know?: We have seen this issue several times, however it seems random when a node fails to release the ports. We experience this in our medium sized cluster maybe once a week. Other nodes are working just fine when this happens. A quick fix is to replace the bad node.
Environment
kubectl version
): v1.17.13