Multiple intermittent restarts in ebs-csi driver

Neha130 commented 3 months ago

/kind bug

What happened?

There have been multiple intermittent restarts in almost all ebs-csi-controller containers, attaching previous container logs below :

container : csi-provisioner

[Aug 08 2024 21:04:06 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:34:06.841330       1 controller.go:811] Starting provisioner controller ebs.csi.aws.com_ebs-csi-controller-5c7698687-mqxfp_33cd4205-586f-42f2-a13b-ee18c5cf8f67!
[Aug 08 2024 21:04:06 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:34:06.942386       1 controller.go:860] Started provisioner controller ebs.csi.aws.com_ebs-csi-controller-5c7698687-mqxfp_33cd4205-586f-42f2-a13b-ee18c5cf8f67!
[Aug 18 2024 23:56:45 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: E0818 18:26:45.933586       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
[Aug 18 2024 23:56:48 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0818 18:26:48.927115       1 leaderelection.go:283] failed to renew lease utils/ebs-csi-aws-com: timed out waiting for the condition
[Aug 18 2024 23:56:48 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: F0818 18:26:48.927147       1 leader_election.go:182] stopped leading
[Aug 18 2024 23:56:49 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0818 18:26:48.932146       1 volume_store.go:104] Stopped save volume queue
[Aug 18 2024 23:56:49 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0818 18:26:48.932146       1 volume_store.go:104] Stopped save volume queue
[Aug 18 2024 23:56:49 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0818 18:26:48.932146       1 volume_store.go:104] Stopped save volume queue

[Aug 18 2024 18:33:53 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: I0818 13:03:53.350772       1 leaderelection.go:285] failed to renew lease kube-system/ebs-csi-aws-com: timed out waiting for the condition
[Aug 18 2024 18:33:53 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: F0818 13:03:53.350826       1 leader_election.go:181] stopped leading

container : csi-attacher

[Aug 08 2024 21:04:35 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:34:35.508399       1 csi_handler.go:282] Detaching "csi-16e4c15998045a1b011d7e7f034740d248b227d0d2d1d485d8b81949996fc8d1"
[Aug 08 2024 21:04:35 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:34:35.513470       1 csi_handler.go:251] Attaching "csi-80c174c723a6f7e84710cba88b5b4ca36286b06f8808379a063da772ce0b598a"
[Aug 08 2024 21:04:35 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:34:35.912392       1 csi_handler.go:581] Detached "csi-16e4c15998045a1b011d7e7f034740d248b227d0d2d1d485d8b81949996fc8d1"
[Aug 08 2024 21:04:37 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:34:37.460209       1 csi_handler.go:264] Attached "csi-80c174c723a6f7e84710cba88b5b4ca36286b06f8808379a063da772ce0b598a"
[Aug 08 2024 21:05:32 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:35:32.159139       1 csi_handler.go:251] Attaching "csi-1f84f2fd70d26f1d43500d0b05aaef3a05a7964e36770d94304267568902fc90"
[Aug 08 2024 21:05:34 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0808 15:35:34.184115       1 csi_handler.go:264] Attached "csi-1f84f2fd70d26f1d43500d0b05aaef3a05a7964e36770d94304267568902fc90"
[Aug 18 2024 23:56:46 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: E0818 18:26:46.780871       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
[Aug 18 2024 23:56:49 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: I0818 18:26:49.769493       1 leaderelection.go:283] failed to renew lease utils/external-attacher-leader-ebs-csi-aws-com: timed out waiting for the condition
[Aug 18 2024 23:56:49 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: F0818 18:26:49.769524       1 leader_election.go:182] stopped leading

container : csi-resizer

[Aug 18 2024 18:33:54 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: I0818 13:03:54.034790       1 controller.go:262] "Shutting down external resizer" controller="ebs.csi.aws.com"
[Aug 18 2024 18:33:54 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: E0818 13:03:54.034653       1 leaderelection.go:332] error retrieving resource lock kube-system/external-resizer-ebs-csi-aws-com: Get "https://10.100.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/external-resizer-ebs-csi-aws-com": context deadline exceeded
[Aug 18 2024 18:33:54 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: I0818 13:03:54.034699       1 leaderelection.go:285] failed to renew lease kube-system/external-resizer-ebs-csi-aws-com: timed out waiting for the condition
[Aug 18 2024 18:33:54 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: F0818 13:03:54.034731       1 leader_election.go:181] stopped leading
[Aug 18 2024 18:33:54 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: I0818 13:03:54.034790       1 controller.go:262] "Shutting down external resizer" controller="ebs.csi.aws.com"

container : csi-snapshotter

[Aug 18 2024 18:33:58 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: I0818 13:03:58.527728       1 leaderelection.go:285] failed to renew lease kube-system/external-snapshotter-leader-ebs-csi-aws-com: timed out waiting for the condition
[Aug 18 2024 18:33:58 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: F0818 13:03:58.527782       1 leader_election.go:181] stopped leading
[Aug 18 2024 18:33:58 GMT+0530] ebs-csi-controller-c76b64f95-fvnnt: I0818 13:03:58.527728       1 leaderelection.go:285] failed to renew lease kube-system/external-snapshotter-leader-ebs-csi-aws-com: timed out waiting for the condition

Environment

Kubernetes version (use kubectl version): v1.30.0
Driver version: images version we are using : csi-attacher: v4.5.1-eks-1-30-2 csi-provisioner: v4.0.1-eks-1-30-2 csi-snapshotter: v7.0.2-eks-1-30-2 csi-resizer: v1.10.1-eks-1-30-2

ConnorJC3 commented 3 months ago

Hi @Neha130 - the errors you are experiencing indicate an issue with your Kubernetes control plane. Based on the logs, the Kubernetes API server appears to be timing out when the sidecars are attempting to update the lease.

In particular, these errors indicate a likely issue with your cluster's etcd installation:

[Aug 18 2024 23:56:45 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: E0818 18:26:45.933586       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
[Aug 18 2024 23:56:46 GMT+0530] ebs-csi-controller-5c7698687-mqxfp: E0818 18:26:46.780871       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out

You will need to rectify this issue for the EBS CSI Driver to function properly. The EBS CSI Driver (and the Kubernetes CSI sidecars it uses) are not designed to work in an environment where the Kubernetes API server is failing or timing out requests, and may experience abnormal behavior such as the restarts you are seeing in such an environment.

I would recommend reaching out for support from whoever operates your Kubernetes cluster or provides your Kubernetes distro, if applicable, for assistance.

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ConnorJC3 commented 2 weeks ago

/close

Closing this issue out due to inactivity. If you need further assistance please reopen this issue or open a new one.

k8s-ci-robot commented 2 weeks ago

@ConnorJC3: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/2122#issuecomment-2488648410): >/close > >Closing this issue out due to inactivity. If you need further assistance please reopen this issue or open a new one. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / aws-ebs-csi-driver

Multiple intermittent restarts in ebs-csi driver #2122