Closed MiGrandjean closed 3 months ago
Hi @MiGrandjean, can you provide the debugging logs by following the same section we have, it is still the same for efs-proxy also. https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting
Hi @MiGrandjean, can you provide the debugging logs by following the same section we have, it is still the same for efs-proxy also. https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting
sorry for the delay, I fell sick. I'll attach the debugging results as requested :) => results.zip
FTR: For compliance reasons, I replaced the AWS Account ID with 111122223333
and the EFS Filesystem ID with fs-00bbccddeeff0000ff
.
Experiencing same or similar issue.
Fresh v1.30 cluster. v2.0.3 always times out. v1.7.7 works without any issues.
FTR, I tested now also with v2.0.3 - still the same. v2.0.0, v2.0.1, v2.0.2, v2.0.3 all show the same error for me. v1.7.7 working like a charm.
We are unable to recreate the issue, is there any special stunnel configuration on your ends? v2.X of the driver replaces stunnel with efs-proxy as you mentioned, just wondering whether there is any additional context that may help us reproduce this?
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f storageclass.yaml
storageclass.storage.k8s.io/efs-sc created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f pod.yaml
persistentvolumeclaim/efs-claim created
pod/efs-app created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default efs-app 1/1 Running 0 7s
kube-system aws-node-brz5q 2/2 Running 2 (18m ago) 2d3h
kube-system aws-node-ffm4k 2/2 Running 2 (18m ago) 2d3h
kube-system coredns-6787556b84-27gjj 1/1 Running 1 (18m ago) 2d3h
kube-system coredns-6787556b84-2mn8r 1/1 Running 1 (18m ago) 2d3h
kube-system efs-csi-controller-77569d79cb-b99cx 3/3 Running 0 35s
kube-system efs-csi-controller-77569d79cb-r9vmx 3/3 Running 0 35s
kube-system efs-csi-node-q2bpg 3/3 Running 0 35s
kube-system efs-csi-node-x2fsc 3/3 Running 0 35s
kube-system kube-proxy-mvz69 1/1 Running 1 (18m ago) 2d3h
kube-system kube-proxy-t8ss8 1/1 Running 1 (18m ago) 2d3h
# describe controller to show v1.7.7
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 85s default-scheduler Successfully assigned kube-system/efs-csi-controller-77569d79cb-b99cx to ip-192-168-21-74.ap-southeast-1.compute.internal
Normal Pulled 84s kubelet Container image "public.ecr.aws/efs-csi-driver/amazon/aws-efs-csi-driver:v1.7.7" already present on machine
Normal Created 84s kubelet Created container efs-plugin
Normal Started 84s kubelet Started container efs-plugin
Normal Pulled 84s kubelet Container image "public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner:v4.0.0-eks-1-29-7" already present on machine
Normal Created 84s kubelet Created container csi-provisioner
Normal Started 84s kubelet Started container csi-provisioner
Normal Pulled 84s kubelet Container image "public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-7" already present on machine
Normal Created 84s kubelet Created container liveness-probe
Normal Started 84s kubelet Started container liveness-probe
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ helm uninstall aws-efs-csi-driver -n kube-system
release "aws-efs-csi-driver" uninstalled
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ helm install aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver --namespace kube-system --set controller.serviceAccount.create=false --set control
ler.serviceAccount.name=efs-csi-controller-sa
NAME: aws-efs-csi-driver
LAST DEPLOYED: Wed Jun 5 18:52:47 2024
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
To verify that aws-efs-csi-driver has started, run:
kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-efs-csi-driver,app.kubernetes.io/instance=aws-efs-csi-driver"
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f storageclass.yaml
storageclass.storage.k8s.io/efs-sc created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl apply -f pod.yaml
persistentvolumeclaim/efs-claim created
pod/efs-app created
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 automationdriver]$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default efs-app 1/1 Running 0 5s
kube-system aws-node-brz5q 2/2 Running 2 (23m ago) 2d3h
kube-system aws-node-ffm4k 2/2 Running 2 (23m ago) 2d3h
kube-system coredns-6787556b84-27gjj 1/1 Running 1 (23m ago) 2d3h
kube-system coredns-6787556b84-2mn8r 1/1 Running 1 (23m ago) 2d3h
kube-system efs-csi-controller-64bf5b7744-9r8mr 3/3 Running 0 2m48s
kube-system efs-csi-controller-64bf5b7744-b7lp6 3/3 Running 0 2m48s
kube-system efs-csi-node-jwk6s 3/3 Running 0 2m49s
kube-system efs-csi-node-xnzqj 3/3 Running 0 2m49s
kube-system kube-proxy-mvz69 1/1 Running 1 (23m ago) 2d3h
kube-system kube-proxy-t8ss8 1/1 Running 1 (23m ago) 2d3h
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m16s default-scheduler Successfully assigned kube-system/efs-csi-controller-64bf5b7744-b7lp6 to ip-192-168-21-74.ap-southeast-1.compute.internal
Normal Pulled 3m16s kubelet Container image "public.ecr.aws/efs-csi-driver/amazon/aws-efs-csi-driver:v2.0.3" already present on machine
Normal Created 3m16s kubelet Created container efs-plugin
Normal Started 3m16s kubelet Started container efs-plugin
Normal Pulled 3m16s kubelet Container image "public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner:v4.0.0-eks-1-29-7" already present on machine
Normal Created 3m16s kubelet Created container csi-provisioner
Normal Started 3m16s kubelet Started container csi-provisioner
Normal Pulled 3m16s kubelet Container image "public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-7" already present on machine
Normal Created 3m16s kubelet Created container liveness-probe
Normal Started 3m16s kubelet Started container liveness-probe
We are unable to recreate the issue, is there any special stunnel configuration on your ends? v2.X of the driver replaces stunnel with efs-proxy as you mentioned, just wondering whether there is any additional context that may help us reproduce this?
Thanks for looking at the issue :) We have no special stunnel configuration that I'm aware of. We use the default image and all config is done with the Helm values as given above.
However, since it's working for you, there has do be some difference 🤔 Our VPC, EKS clusters and EFS filesystems are deployed via Terraform - maybe we have some incompatible settings there or on the networking level? I'll comb through this once more and try to find any differences to a manual installation.
In our case VPC and EKS are provisioned with AWS EKS v20.13.1 and AWS VPC v5.8.1 Terraform modules. EFS is provisioned using Terraform provider directly.
No special stunnel configurations.
Does the issue occur every time you attempt to recreate it, or is it something that appears transiently?
Does the issue occur every time you attempt to recreate it, or is it something that appears transiently?
Happens every time.
I still can't find a difference in the settings, but is it possible that this is a IPv4 vs IPv6 problem?
I still try to make sense of the TCP traces, but those for v1.7.7 look like this (ACKs, regular IPv4 traffic):
While those for v2.0.4 look like this (only SYNs, no ACKs, IPv6 traffic):
This is the same node and the same EFS filesystem.
@seanzatzdev-amazon @mikluko are your EKS clusters IPv6 enabled?
Does the issue occur every time you attempt to recreate it, or is it something that appears transiently?
It's persistent. Happens every single time.
@seanzatzdev-amazon @mikluko are your EKS clusters IPv6 enabled?
@MiGrandjean Cluster nodes are IPv6-enabled. Overlay network is IPv4.
My cluster is IPV4, and subnet is IPV4. EFS doesn't support IPV6 currently
Can you run aws ec2 describe-subnets?
We have run into customers in the past who had the DNS64 option enabled in their subnets, which caused issues similar to this when used incorrectly. Reference - https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-nat64-dns64.html#nat-gateway-nat64-dns64-walkthrough
{
"AvailabilityZone": "us-east-2b",
"AvailabilityZoneId": "use2-az2",
"AvailableIpAddressCount": 4027,
"CidrBlock": "172.17.16.0/20",
"DefaultForAz": false,
"MapPublicIpOnLaunch": false,
"MapCustomerOwnedIpOnLaunch": false,
"State": "available",
"SubnetId": "subnet-xxxxxxxxxxxxxxxxx",
"VpcId": "vpc-xxxxxxxxxxxxxxxxx",
"OwnerId": "111111111111",
"AssignIpv6AddressOnCreation": true,
"Ipv6CidrBlockAssociationSet": [
{
"AssociationId": "subnet-cidr-assoc-xxxxxxxxxxxxxxxxx",
"Ipv6CidrBlock": "2600:xxxx:xxxx:xxxx::/64",
"Ipv6CidrBlockState": {
"State": "associated"
}
}
],
"SubnetArn": "arn:aws:ec2:us-east-2:111111111111:subnet/subnet-xxxxxxxxxxxxxxxxx",
"EnableDns64": true,
"Ipv6Native": false,
"PrivateDnsNameOptionsOnLaunch": {
"HostnameType": "ip-name",
"EnableResourceNameDnsARecord": false,
"EnableResourceNameDnsAAAARecord": true
}
}
Reading through the docs left me with not so firm understanding that DNS64 meaningfull for non-dualstack IPv6 subnets only. AWS VPC Terraform module enables DNS64 for all subnets by default. I will go and try to switch it off.
Please refer to the section "Enable communication with IPv4-only services on the internet with the AWS CLI" and follow the steps listed to allow your cluster to communicate with EFS.
v2.0.3 started to work when I disabled DNS64 on cluster node subnets.
I'm inclined to conclude it's a bug in AWS VPC Terraform module which enables DNS64 by default for all subnets.
Good to hear, I'll try that on monday in our clusters and will report back.
Awesome
Awesome
@seanzatzdev-amazon Thanks for helping to get to the bottom of that!
Yes, can confirm, disabling DNS64 on the subnets with the cluster nodes fixes the issue :)
Thanks for your cooperation, I will close the issue for now. Let us know if there are any other difficulties
/kind bug
What happened?
After upgrading the Helm Chart from
v2.5.7
tov3.0.3
(which also updates the aws-efs-csi-driver fromv1.7.7
tov2.0.2
), new volume mounts of our Pods time out and then fail:What you expected to happen?
Successfully mount the volume :)
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
AmazonEKS_EFS_CSI_DriverRole
has the AWS managedAmazonEFSCSIDriverPolicy
policy attached and was created according to https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/iam-policy-create.mdEnvironment
v1.29
(Server version:v1.29.1-eks-b9c9ed7
)v2.0.2
v3.0.3
Please also attach debug logs to help us better diagnose
Is the "Enable debug logging for efs-utils" section of https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting still up-to-date? Since efs-utils replaced stunnel with efs-proxy, is the described procedure still supposed to work? It didn't change anything in the log outputs for me. I did increased
logLevel
to5
in the Helm values, though. This is the Log from the efs-csi-node Pod (efs-plugin container):Compared to v2.5.7 / v1.7.7 where it just works:
If I should provide more details, I'll happily do so - just tell me how :)