kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
699 stars 528 forks source link

Liveness probe for Pods using EFS volume mounts fails after upgrade/downgrade of EFS version #1156

Closed roshanirathi closed 1 month ago

roshanirathi commented 10 months ago

I have an EKS cluster (1.27) with EFS 1.5.6 running. When I install prometheus on it, the pods come up, volume mounts are successful. When I upgrade EFS to 1.6.0 version, the prometheus-operator statefulset goes to NotReady state. I can see this from the kubectl events Liveness probe failed: Get "http://10.0.3.219:9090/-/healthy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Similar behaviour is seen with wordpress app. This happens only in upgrade or downgrade scenario. I have checked this for other CSI like EBS and Vsphere, this issue is not seen, so it is a EFS issue.

What you expected to happen? Liveness probe should not fail because of upgrade ot downgrade of EFS pack.

How to reproduce it (as minimally and precisely as possible)?

  1. Create a EKS cluster with EFS 1.5.6
  2. Install prometheus on top of it.
  3. Upgrade EFS to 1.6.0

Environment

seanzatzdev-amazon commented 10 months ago

@roshanirathi How do you install the driver? Also, if you can, could you please provide DEBUG level logs? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/README.md

seanzatzdev-amazon commented 10 months ago

@roshanirathi Do you have any similar error logs as this hostNetwork issue [link]?

roshanirathi commented 9 months ago

I use helm chart for driver installation. The efs-utils logs are similar. And the issue is also same. Mounts don't work after upgrade.

The debug logs are present here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing

roshanirathi commented 9 months ago

The issue is fixed in 1.7.0 version. Closing this.

roshanirathi commented 9 months ago

/reopen

k8s-ci-robot commented 9 months ago

@roshanirathi: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/1156#issuecomment-1757474577): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
roshanirathi commented 9 months ago

I am still seeing this issue when I have EBS as a CSI layer and EFS as an addon CSI. The volume mounts are using EFS storageclass and same issue is seen when I upgrade from 1.5.6 to 1.7.0.

seanzatzdev-amazon commented 9 months ago

Hi @roshanirathi , to ensure that we have the most accurate information, could you please provide the new debug level logs?

roshanirathi commented 9 months ago

Added the logs here - https://drive.google.com/drive/folders/1jBmNqdV4UEGMRbm7IFeU13GFLZdpjYdd?usp=sharing

seanzatzdev-amazon commented 9 months ago

@roshanirathi From what I understand, you did the following:

I don't fully understand where EBS factors into your setup. Do you have both EFS & EBS CSI drivers running?

Could you please provide a list of instructions to replicate the new issue?

roshanirathi commented 9 months ago
  1. Create a EKS cluster with EBS CSI.
  2. Deploy EFS 1.5.6 driver on it.
  3. Deploy prometheus or wordpress app on it.
  4. Once volumes are mounted, upgrade EFS to 1.7.0.
  5. Once the new EFS pods are up, the volume mount pods go for a toss.

Yes, I have both EBS and EFS on a EKS cluster.

roshanirathi commented 9 months ago

Any update on this?

william00179 commented 9 months ago

We have experienced this same issue when upgrading to 1.7.0 via helm.

Existing pods using EFS all started to fail in accessing their EFS volumes and new pods coming up were unable to mount EFS volumes with Unable to attach or mount volumes: unmounted volumes=[instance-cache], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition until the node was terminated and relaunched.

roshanirathi commented 6 months ago

@seanzatzdev-amazon any update on this? I am seeing the same issue with 1.7.1 version as well.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/1156#issuecomment-2156496430): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.