kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
711 stars 545 forks source link

Volume mount fail after some time (again) #1353

Closed chroche closed 4 months ago

chroche commented 4 months ago

/kind bug

What happened?

After ~15 days of normal activity, our Airflow EFS volumes started failing to mount with the following messages:

  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    6m52s                default-scheduler  Successfully assigned airflow-prod/weeklypipelineindependentspark-dd11d2f655894aac854f0d63faf2ad8a to ip-10-149-33-40.ec2.internal
  Warning  FailedMount  67s (x3 over 6m49s)  kubelet            MountVolume.SetUp failed for volume "airflow-logs" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
  Warning  FailedMount  67s (x2 over 5m12s)  kubelet  MountVolume.SetUp failed for volume "airflow-dags" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

The kubelet logs show this:

> journalctl -u kubelet
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.667906    3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.167875293 +0000 UTC m=+1548768.136185610 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.668265    3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275 podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.168242255 +0000 UTC m=+1548768.136552560 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-logs" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

Then after a short time this additional error:

May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:35.887401    3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:31:07.887378145 +0000 UTC m=+1548833.855688455 (durationBeforeRetry 32s). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: --- Logging error ---
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1115, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: sock.bind(("localhost", tls_port))
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: OSError: [Errno 98] Address already in use
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: During handling of the above exception, another exception occurred:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 74, in emit
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: self.doRollover()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 173, in doRollover
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: os.rename(sfn, dfn)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: FileNotFoundError: [Errno 2] No such file or directory: '/var/log/amazon/efs/mount.log.4' -> '/var/log/amazon/efs/mount.log.5'
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Call stack:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3933, in <module>
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: main()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3911, in main
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: mount_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3043, in mount_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: with bootstrap_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/contextlib.py", line 119, in __enter__
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: return next(self.gen)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1656, in bootstrap_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = choose_tls_port_and_get_bind_sock(config, options, state_file_dir)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1078, in choose_tls_port_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = find_tls_port_in_range_and_get_bind_sock(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1118, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: logging.warning(e)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Message: OSError(98, 'Address already in use')
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Arguments: ()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

Looking at the log files in /var/log/amazon/efs, they are chock full of this kind of messages:

> cat mount-watchdog.log.2
2024-05-28 10:06:28 UTC - ERROR - Unable to parse json in /var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155
Traceback (most recent call last):
  File "/usr/bin/amazon-efs-mount-watchdog", line 1155, in check_efs_mounts
    state = json.load(f)
  File "/usr/lib64/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib64/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 1117 (char 1116)

The mentioned JSON file (and many/most of its siblings) is invalid with extra characters "0}" at the end:

> cat /var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155
{"pid": 17200, "cmd": ["/usr/bin/stunnel5", "/var/run/efs/stunnel-config.fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155"], "files": ["/var/run/efs/stunnel-config.fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155"], "mount_time": 1715937339.0673409, "mountpoint": "/var/lib/kubelet/pods/1ea49582-822f-41d4-965f-14f27b31badc/volumes/kubernetes.io~csi/airflow-dags/mount", "accessPoint": "fsap-0df96db406cc8681f", "mountStateDir": "fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155+", "commonName": "ip-10-149-33-40.els.vpc.local", "region": "us-east-1", "certificateCreationTime": "240517091539Z", "certificate": "/var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155+/certificate.pem", "privateKey": "/etc/amazon/efs/privateKey.pem", "fsId": "fs-00e74d7abdcb5d159", "unmount_count": 0}0}

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

The initial error seems to be the exact same as https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/103, with the available TCP ports being eaten up by unclosed stunnel5 connections:

> netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:20215         0.0.0.0:*               LISTEN      1128623/stunnel5
tcp        0      0 127.0.0.1:20087         0.0.0.0:*               LISTEN      927740/stunnel5
tcp        0      0 127.0.0.1:20055         0.0.0.0:*               LISTEN      926241/stunnel5
tcp        0      0 127.0.0.1:20279         0.0.0.0:*               LISTEN      2789644/stunnel5
tcp        0      0 127.0.0.1:20119         0.0.0.0:*               LISTEN      866695/stunnel5
tcp        0      0 127.0.0.1:20183         0.0.0.0:*               LISTEN      797232/stunnel5
tcp        0      0 127.0.0.1:20311         0.0.0.0:*               LISTEN      616031/stunnel5

It may be that the amazon-efs-mount-watchdog is prevented from doing its job by the failure to parse invalid JSON file?

Environment

Please also attach debug logs to help us better diagnose

mskanth972 commented 4 months ago

Hi @chroche, the latest EFS CSI Driver version 2.0.4, which resolves this issue, is now available on GitHub. The Add-on version will be released soon. Please upgrade and test it if you are not using the Add-on.