kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
711 stars 545 forks source link

Volume mounts time out after upgrading aws-efs-csi-driver from 1.7.7 to 2.0.2 #1343

Closed MiGrandjean closed 3 months ago

MiGrandjean commented 5 months ago

/kind bug

What happened?

After upgrading the Helm Chart from v2.5.7 to v3.0.3 (which also updates the aws-efs-csi-driver from v1.7.7 to v2.0.2), new volume mounts of our Pods time out and then fail:

E0503 08:47:46.370467       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Could not mount "fs-00bbbbbbbbbbbbbbb:/" at "/var/lib/kubelet/pods/88a4322e-686d-4eae-a045-77edd4f4e07b/volumes/kubernetes.io~csi/pvc-2e0b9760-eb6e-4b5e-a48a-f54919673948/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-02d4502d4502d4502,tls fs-00bbbbbbbbbbbbbbb:/ /var/lib/kubelet/pods/88a4322e-686d-4eae-a045-77edd4f4e07b/volumes/kubernetes.io~csi/pvc-2e0b9760-eb6e-4b5e-a48a-f54919673948/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
b'mount.nfs4: Connection timed out'

What you expected to happen?

Successfully mount the volume :)

How to reproduce it (as minimally and precisely as possible)?

controller:
  deleteAccessPointRootDir: false
  logLevel: 5
  serviceAccount:
    create: true
    name: efs-csi-controller-sa
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/AmazonEKS_EFS_CSI_DriverRole
node:
  logLevel: 5
  volMetricsOptIn: false
  serviceAccount:
    create: true
    name: efs-csi-node-sa
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/AmazonEKS_EFS_CSI_DriverRole
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi
storageClasses:
- mountOptions:
  - tls
  name: efs-sc-tools
  parameters:
    directoryPerms: "700"
    fileSystemId: fs-99999999999999999
    provisioningMode: efs-ap
- mountOptions:
  - tls
  name: efs-sc-apps
  parameters:
    directoryPerms: "700"
    fileSystemId: fs-00bbbbbbbbbbbbbbb
    gid: "999"
    provisioningMode: efs-ap
    uid: "999"

Anything else we need to know?:

Environment

Please also attach debug logs to help us better diagnose

Is the "Enable debug logging for efs-utils" section of https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting still up-to-date? Since efs-utils replaced stunnel with efs-proxy, is the described procedure still supposed to work? It didn't change anything in the log outputs for me. I did increased logLevel to 5 in the Helm values, though. This is the Log from the efs-csi-node Pod (efs-plugin container):

E0503 08:47:46.370467       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Could not mount "fs-00bbbbbbbbbbbbbbb:/" at "/var/lib/kubelet/pods/88a4322e-686d-4eae-a045-77edd4f4e07b/volumes/kubernetes.io~csi/pvc-2e0b9760-eb6e-4b5e-a48a-f54919673948/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-02d4502d4502d4502,tls fs-00bbbbbbbbbbbbbbb:/ /var/lib/kubelet/pods/88a4322e-686d-4eae-a045-77edd4f4e07b/volumes/kubernetes.io~csi/pvc-2e0b9760-eb6e-4b5e-a48a-f54919673948/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
b'mount.nfs4: Connection timed out'

Compared to v2.5.7 / v1.7.7 where it just works:

I0503 07:01:45.024974       1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-02d4502d4502d4502,tls fs-00bbbbbbbbbbbbbbb:/ /var/lib/kubelet/pods/f5c11c2c-ebef-4b00-8c7c-c8e8174546a6/volumes/kubernetes.io~csi/pvc-2e0b9760-eb6e-4b5e-a48a-f54919673048/mount)
I0503 07:01:55.800746       1 node.go:193] NodePublishVolume: /var/lib/kubelet/pods/f5c11c2c-ebef-4b00-8c7c-c8e8174546a6/volumes/kubernetes.io~csi/pvc-2e0b9760-eb6e-4b5e-a48a-f54919673048/mount was mounted

If I should provide more details, I'll happily do so - just tell me how :)

mskanth972 commented 5 months ago

Hi @MiGrandjean, can you provide the debugging logs by following the same section we have, it is still the same for efs-proxy also. https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

MiGrandjean commented 4 months ago

Hi @MiGrandjean, can you provide the debugging logs by following the same section we have, it is still the same for efs-proxy also. https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

sorry for the delay, I fell sick. I'll attach the debugging results as requested :) => results.zip

FTR: For compliance reasons, I replaced the AWS Account ID with 111122223333 and the EFS Filesystem ID with fs-00bbccddeeff0000ff.

mikluko commented 4 months ago

Experiencing same or similar issue.

Fresh v1.30 cluster. v2.0.3 always times out. v1.7.7 works without any issues.

MiGrandjean commented 4 months ago

FTR, I tested now also with v2.0.3 - still the same. v2.0.0, v2.0.1, v2.0.2, v2.0.3 all show the same error for me. v1.7.7 working like a charm.

seanzatzdev-amazon commented 3 months ago

We are unable to recreate the issue, is there any special stunnel configuration on your ends? v2.X of the driver replaces stunnel with efs-proxy as you mentioned, just wondering whether there is any additional context that may help us reproduce this?

seanzatzdev-amazon commented 3 months ago
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f storageclass.yaml 
storageclass.storage.k8s.io/efs-sc created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f pod.yaml 
persistentvolumeclaim/efs-claim created
pod/efs-app created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl get pod -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS      AGE
default       efs-app                               1/1     Running   0             7s
kube-system   aws-node-brz5q                        2/2     Running   2 (18m ago)   2d3h
kube-system   aws-node-ffm4k                        2/2     Running   2 (18m ago)   2d3h
kube-system   coredns-6787556b84-27gjj              1/1     Running   1 (18m ago)   2d3h
kube-system   coredns-6787556b84-2mn8r              1/1     Running   1 (18m ago)   2d3h
kube-system   efs-csi-controller-77569d79cb-b99cx   3/3     Running   0             35s
kube-system   efs-csi-controller-77569d79cb-r9vmx   3/3     Running   0             35s
kube-system   efs-csi-node-q2bpg                    3/3     Running   0             35s
kube-system   efs-csi-node-x2fsc                    3/3     Running   0             35s
kube-system   kube-proxy-mvz69                      1/1     Running   1 (18m ago)   2d3h
kube-system   kube-proxy-t8ss8                      1/1     Running   1 (18m ago)   2d3h

# describe controller to show v1.7.7
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  85s   default-scheduler  Successfully assigned kube-system/efs-csi-controller-77569d79cb-b99cx to ip-192-168-21-74.ap-southeast-1.compute.internal
  Normal  Pulled     84s   kubelet            Container image "public.ecr.aws/efs-csi-driver/amazon/aws-efs-csi-driver:v1.7.7" already present on machine
  Normal  Created    84s   kubelet            Created container efs-plugin
  Normal  Started    84s   kubelet            Started container efs-plugin
  Normal  Pulled     84s   kubelet            Container image "public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner:v4.0.0-eks-1-29-7" already present on machine
  Normal  Created    84s   kubelet            Created container csi-provisioner
  Normal  Started    84s   kubelet            Started container csi-provisioner
  Normal  Pulled     84s   kubelet            Container image "public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-7" already present on machine
  Normal  Created    84s   kubelet            Created container liveness-probe
  Normal  Started    84s   kubelet            Started container liveness-probe

[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ helm uninstall aws-efs-csi-driver -n kube-system
release "aws-efs-csi-driver" uninstalled

[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ helm install aws-efs-csi-driver   aws-efs-csi-driver/aws-efs-csi-driver   --namespace kube-system  --set controller.serviceAccount.create=false --set control
ler.serviceAccount.name=efs-csi-controller-sa                          
NAME: aws-efs-csi-driver
LAST DEPLOYED: Wed Jun  5 18:52:47 2024
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
To verify that aws-efs-csi-driver has started, run:

    kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-efs-csi-driver,app.kubernetes.io/instance=aws-efs-csi-driver"

[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f storageclass.yaml 
storageclass.storage.k8s.io/efs-sc created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f pod.yaml 
persistentvolumeclaim/efs-claim created
pod/efs-app created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl get pod -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS      AGE
default       efs-app                               1/1     Running   0             5s
kube-system   aws-node-brz5q                        2/2     Running   2 (23m ago)   2d3h
kube-system   aws-node-ffm4k                        2/2     Running   2 (23m ago)   2d3h
kube-system   coredns-6787556b84-27gjj              1/1     Running   1 (23m ago)   2d3h
kube-system   coredns-6787556b84-2mn8r              1/1     Running   1 (23m ago)   2d3h
kube-system   efs-csi-controller-64bf5b7744-9r8mr   3/3     Running   0             2m48s
kube-system   efs-csi-controller-64bf5b7744-b7lp6   3/3     Running   0             2m48s
kube-system   efs-csi-node-jwk6s                    3/3     Running   0             2m49s
kube-system   efs-csi-node-xnzqj                    3/3     Running   0             2m49s
kube-system   kube-proxy-mvz69                      1/1     Running   1 (23m ago)   2d3h
kube-system   kube-proxy-t8ss8                      1/1     Running   1 (23m ago)   2d3h

Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m16s  default-scheduler  Successfully assigned kube-system/efs-csi-controller-64bf5b7744-b7lp6 to ip-192-168-21-74.ap-southeast-1.compute.internal
  Normal  Pulled     3m16s  kubelet            Container image "public.ecr.aws/efs-csi-driver/amazon/aws-efs-csi-driver:v2.0.3" already present on machine
  Normal  Created    3m16s  kubelet            Created container efs-plugin
  Normal  Started    3m16s  kubelet            Started container efs-plugin
  Normal  Pulled     3m16s  kubelet            Container image "public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner:v4.0.0-eks-1-29-7" already present on machine
  Normal  Created    3m16s  kubelet            Created container csi-provisioner
  Normal  Started    3m16s  kubelet            Started container csi-provisioner
  Normal  Pulled     3m16s  kubelet            Container image "public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-7" already present on machine
  Normal  Created    3m16s  kubelet            Created container liveness-probe
  Normal  Started    3m16s  kubelet            Started container liveness-probe
MiGrandjean commented 3 months ago

We are unable to recreate the issue, is there any special stunnel configuration on your ends? v2.X of the driver replaces stunnel with efs-proxy as you mentioned, just wondering whether there is any additional context that may help us reproduce this?

Thanks for looking at the issue :) We have no special stunnel configuration that I'm aware of. We use the default image and all config is done with the Helm values as given above.

However, since it's working for you, there has do be some difference 🤔 Our VPC, EKS clusters and EFS filesystems are deployed via Terraform - maybe we have some incompatible settings there or on the networking level? I'll comb through this once more and try to find any differences to a manual installation.

mikluko commented 3 months ago

In our case VPC and EKS are provisioned with AWS EKS v20.13.1 and AWS VPC v5.8.1 Terraform modules. EFS is provisioned using Terraform provider directly.

No special stunnel configurations.

seanzatzdev-amazon commented 3 months ago

Does the issue occur every time you attempt to recreate it, or is it something that appears transiently?

MiGrandjean commented 3 months ago

Does the issue occur every time you attempt to recreate it, or is it something that appears transiently?

Happens every time.

MiGrandjean commented 3 months ago

I still can't find a difference in the settings, but is it possible that this is a IPv4 vs IPv6 problem?

I still try to make sense of the TCP traces, but those for v1.7.7 look like this (ACKs, regular IPv4 traffic):

image

While those for v2.0.4 look like this (only SYNs, no ACKs, IPv6 traffic):

image

This is the same node and the same EFS filesystem.

@seanzatzdev-amazon @mikluko are your EKS clusters IPv6 enabled?

mikluko commented 3 months ago

Does the issue occur every time you attempt to recreate it, or is it something that appears transiently?

It's persistent. Happens every single time.

mikluko commented 3 months ago

@seanzatzdev-amazon @mikluko are your EKS clusters IPv6 enabled?

@MiGrandjean Cluster nodes are IPv6-enabled. Overlay network is IPv4.

seanzatzdev-amazon commented 3 months ago

My cluster is IPV4, and subnet is IPV4. EFS doesn't support IPV6 currently

seanzatzdev-amazon commented 3 months ago

Can you run aws ec2 describe-subnets?

We have run into customers in the past who had the DNS64 option enabled in their subnets, which caused issues similar to this when used incorrectly. Reference - https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-nat64-dns64.html#nat-gateway-nat64-dns64-walkthrough

mikluko commented 3 months ago
{
  "AvailabilityZone": "us-east-2b",
  "AvailabilityZoneId": "use2-az2",
  "AvailableIpAddressCount": 4027,
  "CidrBlock": "172.17.16.0/20",
  "DefaultForAz": false,
  "MapPublicIpOnLaunch": false,
  "MapCustomerOwnedIpOnLaunch": false,
  "State": "available",
  "SubnetId": "subnet-xxxxxxxxxxxxxxxxx",
  "VpcId": "vpc-xxxxxxxxxxxxxxxxx",
  "OwnerId": "111111111111",
  "AssignIpv6AddressOnCreation": true,
  "Ipv6CidrBlockAssociationSet": [
    {
      "AssociationId": "subnet-cidr-assoc-xxxxxxxxxxxxxxxxx",
      "Ipv6CidrBlock": "2600:xxxx:xxxx:xxxx::/64",
      "Ipv6CidrBlockState": {
        "State": "associated"
      }
    }
  ],
  "SubnetArn": "arn:aws:ec2:us-east-2:111111111111:subnet/subnet-xxxxxxxxxxxxxxxxx",
  "EnableDns64": true,
  "Ipv6Native": false,
  "PrivateDnsNameOptionsOnLaunch": {
    "HostnameType": "ip-name",
    "EnableResourceNameDnsARecord": false,
    "EnableResourceNameDnsAAAARecord": true
  }
}
mikluko commented 3 months ago

Reading through the docs left me with not so firm understanding that DNS64 meaningfull for non-dualstack IPv6 subnets only. AWS VPC Terraform module enables DNS64 for all subnets by default. I will go and try to switch it off.

seanzatzdev-amazon commented 3 months ago

Please refer to the section "Enable communication with IPv4-only services on the internet with the AWS CLI" and follow the steps listed to allow your cluster to communicate with EFS.

mikluko commented 3 months ago

v2.0.3 started to work when I disabled DNS64 on cluster node subnets.

mikluko commented 3 months ago

I'm inclined to conclude it's a bug in AWS VPC Terraform module which enables DNS64 by default for all subnets.

MiGrandjean commented 3 months ago

Good to hear, I'll try that on monday in our clusters and will report back.

seanzatzdev-amazon commented 3 months ago

Awesome

mikluko commented 3 months ago

Awesome

@seanzatzdev-amazon Thanks for helping to get to the bottom of that!

MiGrandjean commented 3 months ago

Yes, can confirm, disabling DNS64 on the subnets with the cluster nodes fixes the issue :)

seanzatzdev-amazon commented 3 months ago

Thanks for your cooperation, I will close the issue for now. Let us know if there are any other difficulties