Closed jgoeres closed 10 months ago
I am with the same problem. EFS version 1.4.2, cluster and nodes in kuberentes version 1.19.
@jgoeres Did you find any solution?
Thanks
Me too
Try setting resources requests for containers. Haven't seen this error for quite a while after adding them. https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/325#issuecomment-948639653
We have experienced the same problem on one of our clusters with high workload. We already have setup resources request, but this doesn't help. EFS driver version: 1.4.0 k8s version: 1.21
we have the same issue EKS: 1.21.14 EFS Driver Version 1.4.0
Meet the same problem EKS Version : 1.21 EFS Driver Version 1.4.0
This issue might be resolved by upgrading to the latest driver version, v1.4.9. In v1.4.8, we fixed a concurrency issue with efs-utils that could cause this to happen.
If anyone runs into this again, can you please follow the troubleshooting guide to enable efs-utils debug logging, execute the log collector script, and then post any relevant errors from the mount.log
file? This file contains the logs for efs-utils, which is doing the actual mounting "under the hood" of the csi driver.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I'm noticing this problem on EFS CSI v1.5.6.
Pod Event Error
Warning FailedAttachVolume 107s (x9 over 19m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-8a00b9f5-58e0-4e2d-a294-8a9c45e57a1a" : timed out waiting for external-attacher of efs.csi.aws.com CSI driver to attach volume fs-2a825351::fsap-0d75583a12ada3174
These are the log dumps from the log_collector.py tool.
driver_info
kubectl describe pod efs-csi-node-w4sl9 -n kube-system
Name: efs-csi-node-w4sl9
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: efs-csi-node-sa
Node: ip-10-116-161-48.us-east-2.compute.internal/10.116.161.48
Start Time: Thu, 15 Jun 2023 20:05:42 -0500
Labels: app=efs-csi-node
app.kubernetes.io/instance=efs-csi-awscmh2
app.kubernetes.io/name=aws-efs-csi-driver
controller-revision-hash=7dbf8cbdd4
pod-template-generation=7
Annotations: apps.indeed.com/ship-logs: true
kubernetes.io/psp: privileged
vpaObservedContainers: efs-plugin, csi-driver-registrar, liveness-probe
vpaUpdates:
Pod resources updated by efs-csi-node: container 0: cpu request, memory request; container 1: cpu request, memory request; container 2: cp...
Status: Running
IP: 10.116.161.48
IPs:
IP: 10.116.161.48
Controlled By: DaemonSet/efs-csi-node
Containers:
efs-plugin:
Container ID: containerd://13db8a2a7ac72c870487495ec95aa197767b056c2d65baab0a5be42b17a37cd1
Image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
Image ID: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver@sha256:cba55174d2df13e9939a83b9d71e8b74f6a27ada2e44252ac80136e33a992d6e
Port: 9809/TCP
Host Port: 9809/TCP
Args:
--endpoint=$(CSI_ENDPOINT)
--logtostderr
--v=5
--vol-metrics-opt-in=false
--vol-metrics-refresh-period=240
--vol-metrics-fs-rate-limit=5
State: Running
Started: Thu, 15 Jun 2023 20:05:48 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=2s #success=1 #failure=5
Environment:
CSI_ENDPOINT: unix:/csi/csi.sock
Mounts:
/csi from plugin-dir (rw)
/etc/amazon/efs-legacy from efs-utils-config-legacy (rw)
/var/amazon/efs from efs-utils-config (rw)
/var/lib/kubelet from kubelet-dir (rw)
/var/run/efs from efs-state-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
csi-driver-registrar:
Container ID: containerd://00a9ea19ed72327e5f808bd87a408f81629c5e86abc8e103773006308eba5f98
Image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
Image ID: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:74e13dfff1d73b0e39ae5883b5843d1672258b34f7d4757995c72d92a26bed1e
Port: <none>
Host Port: <none>
Args:
--csi-address=$(ADDRESS)
--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
--v=5
State: Running
Started: Thu, 15 Jun 2023 20:05:49 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Environment:
ADDRESS: /csi/csi.sock
DRIVER_REG_SOCK_PATH: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock
KUBE_NODE_NAME: (v1:spec.nodeName)
Mounts:
/csi from plugin-dir (rw)
/registration from registration-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
liveness-probe:
Container ID: containerd://c9e7ab896df75b1249cbbf489adf8fe31d57e2caaf69d49b71a24c3a25858e39
Image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
Image ID: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:25b4d3f9cf686ac464a742ead16e705da3adcfe574296dd75c5c05ec7473a513
Port: <none>
Host Port: <none>
Args:
--csi-address=/csi/csi.sock
--health-port=9809
--v=5
State: Running
Started: Thu, 15 Jun 2023 20:05:50 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Environment: <none>
Mounts:
/csi from plugin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kubelet-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType: Directory
plugin-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins/efs.csi.aws.com/
HostPathType: DirectoryOrCreate
registration-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins_registry/
HostPathType: Directory
efs-state-dir:
Type: HostPath (bare host directory volume)
Path: /var/run/efs
HostPathType: DirectoryOrCreate
efs-utils-config:
Type: HostPath (bare host directory volume)
Path: /var/amazon/efs
HostPathType: DirectoryOrCreate
efs-utils-config-legacy:
Type: HostPath (bare host directory volume)
Path: /etc/amazon/efs
HostPathType: DirectoryOrCreate
kube-api-access-xw45q:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events: <none>
kubectl get pod efs-csi-node-w4sl9 -n kube-system -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
apps.indeed.com/ship-logs: "true"
kubernetes.io/psp: privileged
vpaObservedContainers: efs-plugin, csi-driver-registrar, liveness-probe
vpaUpdates: 'Pod resources updated by efs-csi-node: container 0: cpu request,
memory request; container 1: cpu request, memory request; container 2: cpu request,
memory request'
creationTimestamp: "2023-06-16T01:05:42Z"
generateName: efs-csi-node-
labels:
app: efs-csi-node
app.kubernetes.io/instance: efs-csi-awscmh2
app.kubernetes.io/name: aws-efs-csi-driver
controller-revision-hash: 7dbf8cbdd4
pod-template-generation: "7"
name: efs-csi-node-w4sl9
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: efs-csi-node
uid: aa1527ec-97b6-498c-a21d-9a642d26c242
resourceVersion: "2386821689"
uid: eccdbf2a-3285-4adc-8ad2-c7ba68c33f02
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-10-116-161-48.us-east-2.compute.internal
containers:
- args:
- --endpoint=$(CSI_ENDPOINT)
- --logtostderr
- --v=5
- --vol-metrics-opt-in=false
- --vol-metrics-refresh-period=240
- --vol-metrics-fs-rate-limit=5
env:
- name: CSI_ENDPOINT
value: unix:/csi/csi.sock
image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: healthz
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 3
name: efs-plugin
ports:
- containerPort: 9809
hostPort: 9809
name: healthz
protocol: TCP
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet
mountPropagation: Bidirectional
name: kubelet-dir
- mountPath: /csi
name: plugin-dir
- mountPath: /var/run/efs
name: efs-state-dir
- mountPath: /var/amazon/efs
name: efs-utils-config
- mountPath: /etc/amazon/efs-legacy
name: efs-utils-config-legacy
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xw45q
readOnly: true
- args:
- --csi-address=$(ADDRESS)
- --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
- --v=5
env:
- name: ADDRESS
value: /csi/csi.sock
- name: DRIVER_REG_SOCK_PATH
value: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
imagePullPolicy: IfNotPresent
name: csi-driver-registrar
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: plugin-dir
- mountPath: /registration
name: registration-dir
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xw45q
readOnly: true
- args:
- --csi-address=/csi/csi.sock
- --health-port=9809
- --v=5
image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
imagePullPolicy: IfNotPresent
name: liveness-probe
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: plugin-dir
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xw45q
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: ip-10-116-161-48.us-east-2.compute.internal
nodeSelector:
kubernetes.io/os: linux
preemptionPolicy: PreemptLowerPriority
priority: 2000001000
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 0
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
serviceAccount: efs-csi-node-sa
serviceAccountName: efs-csi-node-sa
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet
type: Directory
name: kubelet-dir
- hostPath:
path: /var/lib/kubelet/plugins/efs.csi.aws.com/
type: DirectoryOrCreate
name: plugin-dir
- hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
name: registration-dir
- hostPath:
path: /var/run/efs
type: DirectoryOrCreate
name: efs-state-dir
- hostPath:
path: /var/amazon/efs
type: DirectoryOrCreate
name: efs-utils-config
- hostPath:
path: /etc/amazon/efs
type: DirectoryOrCreate
name: efs-utils-config-legacy
- name: kube-api-access-xw45q
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:42Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:51Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:51Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:42Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://00a9ea19ed72327e5f808bd87a408f81629c5e86abc8e103773006308eba5f98
image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
imageID: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:74e13dfff1d73b0e39ae5883b5843d1672258b34f7d4757995c72d92a26bed1e
lastState: {}
name: csi-driver-registrar
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-06-16T01:05:49Z"
- containerID: containerd://13db8a2a7ac72c870487495ec95aa197767b056c2d65baab0a5be42b17a37cd1
image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
imageID: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver@sha256:cba55174d2df13e9939a83b9d71e8b74f6a27ada2e44252ac80136e33a992d6e
lastState: {}
name: efs-plugin
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-06-16T01:05:48Z"
- containerID: containerd://c9e7ab896df75b1249cbbf489adf8fe31d57e2caaf69d49b71a24c3a25858e39
image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
imageID: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:25b4d3f9cf686ac464a742ead16e705da3adcfe574296dd75c5c05ec7473a513
lastState: {}
name: liveness-probe
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-06-16T01:05:50Z"
hostIP: 10.116.161.48
phase: Running
podIP: 10.116.161.48
podIPs:
- ip: 10.116.161.48
qosClass: Burstable
startTime: "2023-06-16T01:05:42Z"
driver_logs
kubectl logs efs-csi-node-w4sl9 -n kube-system efs-plugin
I0616 01:05:48.928661 1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0616 01:05:48.929567 1 metadata.go:63] getting MetadataService...
I0616 01:05:48.931589 1 metadata.go:68] retrieving metadata from EC2 metadata service
I0616 01:05:48.932454 1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.us-east-2.amazonaws.com
I0616 01:05:48.932478 1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0616 01:05:48.932588 1 driver.go:140] Did not find any input tags.
I0616 01:05:48.932739 1 driver.go:113] Registering Node Server
I0616 01:05:48.932752 1 driver.go:115] Registering Controller Server
I0616 01:05:48.932758 1 driver.go:118] Starting efs-utils watchdog
I0616 01:05:48.932833 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0616 01:05:48.932846 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0616 01:05:48.933148 1 driver.go:124] Starting reaper
I0616 01:05:48.933167 1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0616 01:05:50.285468 1 node.go:306] NodeGetInfo: called with args
efs_utils_logs (something seems wrong here)
kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- find /var/log/amazon/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;
find: 'echo': No such file or directory
efs_utils_state_dir
kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- find /var/run/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;
mounts
kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- mount |grep nfs
After further digging in our case, we noticed that the CSIDriver resource was missing in the cluster where the problem above was occurring. We have no idea why it's missing, but manually recreating it caused the controller to start working again.
This doesn't seem to be the first time an issue with the CSIDriver resource was noticed during a helm upgrade. https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/325#issuecomment-779385896
@wmgroot I just experienced the same issue, are you using ArgoCD? I'm still debugging the behaviour, but I can reproduce a "Delete CSIDriver"-diff.
I believe it's related to how helm hooks are used in the chart for that resource and how ArgoCD is handling them.
We are using ArgoCD to manage our EFS CSI installation, yes. We check our Argo diffs as part of our upgrade process and I do not remember seeing a deletion of the CSIDriver, but it's possible that we missed this during a previous upgrade or I wasn't paying enough attention.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Hi, we are using the EFS CSI driver (currently version 1.3.2) to provision EFS-based volumes to our workloads. On one of our clusters is currently suffering from a situation where freshly deployed pods that mount such volumes are stuck in ContainerCreating (resp. "Init:0/" for pods with init containers) for a very long time. Pods that are part of that same workload but do not mount EFS volumes do not suffer from that, so it 99,9% related to the EFS CSI driver.
This is how the (somewhat anonymized) workload presents itself when it is in that stuck state:
As an example, these are the events for the pod meme-default-2 while the pod is in this state (note that the volume that does attach immediately without problems is an EBS volume, handled by the EBS-CSI driver):
Note that in this example, the cluster autoscaler did perform a scale-up, but the issue also occurs on pods scheduled on already existing nodes. So I don't think that the autoscaler is involved in the problem.
The EFS CSI node pod on the node on which the above pod is scheduled logs no obvious errors (at least for someone not familiar with the inner workings of the EFS CSI driver)
Eventually, the attaching/mounting of the EFS volumes will succeed, this can be 10-15 minutes, but sometimes hours. Usually, when the mounting works, it will work for all pods that are currently stuck. But the problem is not gone - when I later scale up a workload (or have new pod launched by, e.g., a cronjob), these new pods will often be stuck again. For example, here we have the pods of a cronjob (running once an hour) not being scheduled for more than two hours because of this problem. Scaling up the "meme" workload to 4 instances has the new pod No. 3 stuck again:
Restarting the EFS-CSI driver pods (both the efs-csi-node DS and efs-csi-controller deployment) sometimes seemed to help, currently it doesn't. Restarting all nodes temporarily fixed it, but the problem will later occur again.
I mentioned that we are observing this in one cluster only at this time. What separates this cluster from others is that only on this cluster, we have a high "workload churn" - the cluster runs several deployments of our application in different namespaces, which are refreshed (i.e. deleted and recreated) several times a day. This deletion includes the EFS-based volumes (we implicitly delete their PVCs by deleting the namespace. The storage class we use for dynamic provisioning has its Reclaim Policy set to Delete, so PVs are also deleted, as are the associated EFS Access Points. On most of our other clusters, we create deployments and then use them for a longer period of time, only performing minor changes (e.g., rollout patches), but keeping the EFS volumes.