After ~15 days of normal activity, our Airflow EFS volumes started failing to mount with the following messages:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m52s default-scheduler Successfully assigned airflow-prod/weeklypipelineindependentspark-dd11d2f655894aac854f0d63faf2ad8a to ip-10-149-33-40.ec2.internal
Warning FailedMount 67s (x3 over 6m49s) kubelet MountVolume.SetUp failed for volume "airflow-logs" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Warning FailedMount 67s (x2 over 5m12s) kubelet MountVolume.SetUp failed for volume "airflow-dags" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
The kubelet logs show this:
> journalctl -u kubelet
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.667906 3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.167875293 +0000 UTC m=+1548768.136185610 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.668265 3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275 podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.168242255 +0000 UTC m=+1548768.136552560 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-logs" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Then after a short time this additional error:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:35.887401 3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:31:07.887378145 +0000 UTC m=+1548833.855688455 (durationBeforeRetry 32s). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: --- Logging error ---
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1115, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: sock.bind(("localhost", tls_port))
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: OSError: [Errno 98] Address already in use
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: During handling of the above exception, another exception occurred:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 74, in emit
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: self.doRollover()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 173, in doRollover
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: os.rename(sfn, dfn)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: FileNotFoundError: [Errno 2] No such file or directory: '/var/log/amazon/efs/mount.log.4' -> '/var/log/amazon/efs/mount.log.5'
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Call stack:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3933, in <module>
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: main()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3911, in main
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: mount_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3043, in mount_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: with bootstrap_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/contextlib.py", line 119, in __enter__
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: return next(self.gen)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1656, in bootstrap_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = choose_tls_port_and_get_bind_sock(config, options, state_file_dir)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1078, in choose_tls_port_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = find_tls_port_in_range_and_get_bind_sock(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1118, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: logging.warning(e)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Message: OSError(98, 'Address already in use')
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Arguments: ()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Looking at the log files in /var/log/amazon/efs, they are chock full of this kind of messages:
> cat mount-watchdog.log.2
2024-05-28 10:06:28 UTC - ERROR - Unable to parse json in /var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155
Traceback (most recent call last):
File "/usr/bin/amazon-efs-mount-watchdog", line 1155, in check_efs_mounts
state = json.load(f)
File "/usr/lib64/python3.7/json/__init__.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib64/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.7/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 1117 (char 1116)
The mentioned JSON file (and many/most of its siblings) is invalid with extra characters "0}" at the end:
Hi @chroche, the latest EFS CSI Driver version 2.0.4, which resolves this issue, is now available on GitHub. The Add-on version will be released soon. Please upgrade and test it if you are not using the Add-on.
/kind bug
What happened?
After ~15 days of normal activity, our Airflow EFS volumes started failing to mount with the following messages:
The
kubelet
logs show this:Then after a short time this additional error:
Looking at the log files in
/var/log/amazon/efs
, they are chock full of this kind of messages:The mentioned JSON file (and many/most of its siblings) is invalid with extra characters "
0}
" at the end:What you expected to happen?
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
The initial error seems to be the exact same as https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/103, with the available TCP ports being eaten up by unclosed
stunnel5
connections:It may be that the
amazon-efs-mount-watchdog
is prevented from doing its job by the failure to parse invalid JSON file?Environment
kubectl version
):Server Version: v1.28.8-eks-adc7111
v1.7.7
from Helm Chartv2.5.7
Please also attach debug logs to help us better diagnose