kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
699 stars 528 forks source link

Unable to attach or mount volumes: unmounted volumes=[...], unattached volumes=[...]: timed out waiting for the condition #1155

Closed eswolinsky3241 closed 1 month ago

eswolinsky3241 commented 10 months ago

/kind bug

What happened?

We use EKS to run a distributed task queue that uses the HPA to scale deployments based on the number of tasks in a Redis queue. The pods in these deployments run on an EC2 managed node group. Every pod in the deployment has the same EFS drive attached to access necessary files. We use the efs-csi-node Daemonset, which is managed by the Helm chart. Sometimes, we scale up to a lot of pods at once to accomodate a large number of jobs added to the queue. We have started to see this error appear on some of these pods:

Unable to attach or mount volumes: unmounted volumes=[migrant], unattached volumes=[hdf5-cache hydra-log shared-pod-storage kube-api-access-g8kfd migrant archive hobo model-cache]: timed out waiting for the condition

Most of the pods start successfully, but the ones that do show this event are just stuck in a “ContainerCreating” status. We have tried increasing resource requests for the Daemonset, but that has not helped, and the efs-csi-driver container logs do not provide any helpful information. This has become a problem for us, because our deployments never scale to the level we need them to.

What you expected to happen?

All pods to start with the EFS-backed volume mounted

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

Please also attach debug logs to help us better diagnose

seanzatzdev-amazon commented 10 months ago

@eswolinsky3241 Could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md

Also, how many pods do you add when this issue starts occurring? If there is any other additional information about your cluster that may help us recreate the issue, please let me know.

sorind-broadsign commented 7 months ago

@eswolinsky3241 Have you found the root cause or any solution to this issue ?
@seanzatzdev-amazon We are facing the same issue on our EKS cluster v1.27.7-eks-4f4795d, we have seen this issue with v1.6.0 and v1.7.2. We see this issue on 2 deployments (~6 pods in total) that use the same SC/PV/PVC to mount EFS volume. Let me know what other information would be helpful. I'm working on getting some debug logs fom efs-csi-driver. Thank you.

eswolinsky3241 commented 7 months ago

@sorind-broadsign Was never able to root cause it but at some point it just stopped happening without any change on my part. Haven’t seen the error in months.

rodrilp commented 6 months ago

@eswolinsky3241 Have you found the root cause or any solution to this issue ? @seanzatzdev-amazon We are facing the same issue on our EKS cluster v1.27.7-eks-4f4795d, we have seen this issue with v1.6.0 and v1.7.2. We see this issue on 2 deployments (~6 pods in total) that use the same SC/PV/PVC to mount EFS volume. Let me know what other information would be helpful. I'm working on getting some debug logs fom efs-csi-driver. Thank you.

Hey @sorind-broadsign, did you have the opportunity to resolved it? I'm in the same situation.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/1155#issuecomment-2158813216): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.