GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled caused by: context deadline exceeded

jonassteinberg1 commented 4 months ago

/kind bug

What happened? Documentation does not specify that this driver and its resultant workflows can only be used on EKS. Much of the documentation is oriented towards EKS, because no one self-hosts anymore, but none of it says EKS must be used. That being the case I have attempted to implement this on self-managed ec2 instances. I have followed the installation instructions, except I do not have an OIDC provider enabled. Is one absolutely required?

# in a sense the controllers are fine
efs-csi-controller-77c44b5fc7-f2pz6       3/3     Running   6 (27m ago)    14h
efs-csi-controller-77c44b5fc7-tj852       3/3     Running   6 (27m ago)    14h

# controller version
1.7.6

# storage class is taken right out of the documentation
# my filesystem is correct
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: "fs-06865cae821d34e69"
  directoryPerms: "700"

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 20Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: efs-example
spec:
  containers:
    - name: app
      image: centos
      command: ["/bin/sh"]
      args: ["-c", "while true; do echo $(date -u) >> /example/out.txt; sleep 5; done"]
      volumeMounts:
        - name: persistent-storage
          mountPath: /example
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: efs-claim

# pvc never leaves pending, not the root problem though
Name:          efs-claim
Namespace:     kube-system
StorageClass:  efs-sc
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: efs.csi.aws.com
               volume.kubernetes.io/storage-provisioner: efs.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       efs-example
Events:
  Type     Reason              Age                From                                                                                      Message
  ----     ------              ----               ----                                                                                      -------
  Warning  ProvisioningFailed  13m (x4 over 19m)  efs.csi.aws.com_efs-csi-controller-77c44b5fc7-f2pz6_25793f87-c7ee-46c1-9538-2ff5b77c521c  failed to provision volume with StorageClass "efs-sc": rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context deadline exceeded
  Normal   Provisioning          4m54s (x12 over 20m)  efs.csi.aws.com_efs-csi-controller-77c44b5fc7-f2pz6_25793f87-c7ee-46c1-9538-2ff5b77c521c  External provisioner is provisioning volume for claim "kube-system/efs-claim"
  Warning  ProvisioningFailed    4m44s (x8 over 19m)   efs.csi.aws.com_efs-csi-controller-77c44b5fc7-f2pz6_25793f87-c7ee-46c1-9538-2ff5b77c521c  failed to provision volume with StorageClass "efs-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   ExternalProvisioning  4s (x82 over 20m)     persistentvolume-controller                                                               Waiting for a volume to be created either by the external provisioner 'efs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

# pod never leaves pending because pvc never leaves pending
Name:             efs-example
Namespace:        kube-system
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  app:
    Image:      centos
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      while true; do echo $(date -u) >> /example/out.txt; sleep 5; done
    Environment:  <none>
    Mounts:
      /example from persistent-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b68sb (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  efs-claim
    ReadOnly:   false
  kube-api-access-b68sb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  66s (x5 over 21m)  default-scheduler  0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

# root issue seems to be with controller

E0322 15:01:48.110792       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context canceled
E0322 15:03:01.956430       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context canceled
E0322 15:03:02.938959       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context deadline exceeded
E0322 15:06:58.112230       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context canceled
E0322 15:08:11.958842       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context canceled
E0322 15:08:12.940853       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context canceled
E0322 15:09:15.726977       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled
caused by: context canceled

Should be simple enough -- some type of access problem. Okay:

My efs filesystem id is fs-06865cae821d34e69 and this is called correctly in my storage class manifest.
My efs access point mount line is sudo mount -t efs -o tls,accesspoint=fsap-0515214d4fe7bdac7 fs-06865cae821d34e69:/ efs not even sure if this is needed?
I have proved that my control plane and worker nodes can talk to each other over efs mount manually
my efs security allows not just 2049 across all IPv4, but all traffic across IPv4
I can hit aws efs anything from all my cluster instances via boto3

my ec2 instance profile role is an admin role that is simply *.* so all permissions, in other words:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "admin",
        "Effect": "Allow",
        "Action": "*",
        "Resource": "*"
    }
]
}

I have an efs vpc endpoint -- not sure if that is needed, but I have one.
The endpoint security group allows all inbound and outbound traffic across all IPv4
the vpc endpoint is in the private subnet along with all my cluster nodes
none of my nodes, all of which are ubuntu, are running ufw or any proxying software
I am running various other services on this cluster all of which run without issue
I can hit anything on the internet I want
I can hit all other aws services via boto3, cli, whatever, from my instances
I had no issues with the efs csi driver Helm chart and other helm charts for other workloads work just fine

What you expected to happen? dynamic provisioning

How to reproduce it (as minimally and precisely as possible)?

spin up ubuntu 22.04 control plane and worker node and create functional cluster
open security group to all traffic
create an efs filesystem (and maybe an access point)
create an efs vpc endpoint (maybe)
install driver via helm
install storage class, pvc and pod

Anything else we need to know?:

The debug script results/efs_utils_state_dir are:

kubectl exec efs-csi-controller-77c44b5fc7-f2pz6 -n kube-system -c efs-plugin -- find /var/run/efs -type f -exec ls {} \; -exec cat {} \;
find: '/var/run/efs': No such file or directory
command terminated with exit code 1

but when I do an ls on /var/run/efs I get:

# a directory does exist
root@ip-10-0-131-109:/home/ubuntu# ls -la /var/run/efs
total 0
drwxr-xr-x  2 root root   40 Mar 22 14:37 .
drwxr-xr-x 34 root root 1120 Mar 22 15:26 ..

Environment

Kubernetes version (use kubectl version):

Client Version: v1.28.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7

Driver version: 1.7.6

Please also attach debug logs to help us better diagnose debug logs attached results.tgz

Instructions to gather debug logs can be found here

jonassteinberg1 commented 3 months ago

@RyanStan any follow up here?

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

thpham commented 3 weeks ago

@jonassteinberg1 , hello, I'm facing the same issue even with the latest addon version. Did you managed to resolve that ? thank you.

SimonMachine commented 2 weeks ago

I had a similar problem, during the checking process I added a netshoot-sidecar container to efs-csi-controller, went inside the pod and checked and found out that it was a dns issue, due to a misconfiguration of coredns. I hope this idea can help you.

dvb-simp commented 1 week ago

@SimonMachine would you mind explaining what the DNS issues were and how you fixed these?

kubernetes-sigs / aws-efs-csi-driver

GRPC error: rpc error: code = Internal desc = Failed to fetch Access Points or Describe File System: List Access Points failed: RequestCanceled: request context canceled caused by: context deadline exceeded #1290