kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Apache License 2.0
997 stars 799 forks source link

Controller pod volumemodifier container sometimes pegs CPU #2030

Closed danports closed 1 month ago

danports commented 6 months ago

/kind bug

What happened? Occasionally the volumemodifier container in one of the EBS CSI controller pods starts using the maximum CPU possible (stuck in some kind of busy wait perhaps?). Here's a chart showing CPU usage for volumemodifier containers in a cluster over 3 days - there are only 2 vCPUs on the nodes so the container is basically using up all of the CPU time available: image You can see from the chart that the lease likely switched from one container to another a couple of times but the container continued pegging the CPU regardless of which pod it was running in.

What you expected to happen? Container should be using a miniscule amount of CPU time like it usually does.

How to reproduce it (as minimally and precisely as possible)? Good question! I've only noticed this problem fairly recently - I can try to track down when it started.

Anything else we need to know?: I have two replicas of the controller running, but usually only one of the replicas has a volumemodifier container running out of control, probably the one with the active lease. The logs for the container look like this:

2024-05-04 17:44:13.668 I0504 21:44:13.668099       1 reflector.go:351] Caches populated for *v1.PersistentVolumeClaim from pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-05-04 17:44:13.666 I0504 21:44:13.665958       1 reflector.go:325] Listing and watching *v1.PersistentVolumeClaim from pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-05-04 17:44:13.529 I0504 21:44:13.529067       1 reflector.go:351] Caches populated for *v1.PersistentVolume from pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-05-04 17:44:13.526 I0504 21:44:13.526610       1 reflector.go:325] Listing and watching *v1.PersistentVolume from pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-05-04 17:44:13.249 I0504 21:44:13.249540       1 reflector.go:351] Caches populated for *v1.Lease from pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-05-04 17:44:13.229 I0504 21:44:13.228949       1 reflector.go:325] Listing and watching *v1.Lease from pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-05-04 17:44:12.131 W0504 21:44:12.131235       1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: watch of *v1.Lease ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-04 17:44:12.131 W0504 21:44:12.130947       1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: watch of *v1.PersistentVolumeClaim ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-04 17:44:12.131 W0504 21:44:12.130939       1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-04 17:42:35.438 I0504 21:42:35.438669       1 reflector.go:800] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Watch close - *v1.PersistentVolumeClaim total 0 items received
2024-05-04 17:38:53.666 I0504 21:38:53.666134       1 reflector.go:800] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Watch close - *v1.PersistentVolume total 2 items received
2024-05-04 17:37:14.433 I0504 21:37:14.433435       1 reflector.go:800] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Watch close - *v1.PersistentVolumeClaim total 5 items received
2024-05-04 17:35:48.458 I0504 21:35:48.458834       1 reflector.go:800] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Watch close - *v1.Lease total 1247 items received
2024-05-04 17:35:18.092 I0504 21:35:18.092294       1 reflector.go:377] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: forcing resync
2024-05-04 17:35:18.054 I0504 21:35:18.054127       1 reflector.go:377] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: forcing resync

It seems like killing the CPU-hogging pod is sufficient to resolve the problem until it recurs.

Environment

AndrewSirenko commented 6 months ago

Hi @danports, thanks for bringing this to our attention, this is unfortunate.

In the short term, if you're blocked by this and are on a kOps cluster, consider switching to the upstream volume modification via VolumeAttributeClass feature (which we are trying to get turned on by default in Kubernetes 1.31 so EKS customers can start using it)

In the meantime I'll go setup some clusters with constant volume modifications and try to spot any misbehaving sidecars so we can track this bug down.

Again, thank you! 🙏

danports commented 6 months ago

Thanks @AndrewSirenko. I'll take a look at the feature you mentioned. This isn't a huge blocker since the issue occurs only infrequently - about 1-2x/month, though I don't have sufficient archived telemetry to confirm the earlier problems are the same issue - and can be worked around by killing the pod. It looks like the problem started in late March, shortly after upgrading to Kubernetes 1.29 from 1.28, which included an upgrade from registry.k8s.io/sig-storage/csi-resizer:v1.4.0 to the current public.ecr.aws/ebs-csi-driver/volume-modifier-for-k8s:v0.2.1 according to the cluster's kOps logs.

danports commented 6 months ago

Also, anything I can do in the cluster (log verbosity, memory dumps, etc.) to help collect troubleshooting data if/when the issue recurs?

AndrewSirenko commented 6 months ago

Thanks for pointing to the regression occurring between 0.1.4 & 0.2.1, we made a few significant changes in between.

And if you could bump up the sidecar's verbosity to 7 that would be helpful. Thanks a million!

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/2030#issuecomment-2393975566): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
danports commented 3 weeks ago

Just saw this issue arise again today with public.ecr.aws/ebs-csi-driver/volume-modifier-for-k8s:v0.3.0. I bumped the sidebar verbosity up to 7 as requested and will update if I see anything useful in the logs the next time this recurs.