kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Apache License 2.0
992 stars 795 forks source link

Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

Closed zjalicflw closed 1 year ago

zjalicflw commented 1 year ago

/kind bug

What happened?

After uninstalling and installing bitnami/kafka Helm chart on my EKS cluster a couple of times due to some errors, a new blocking error occurred. Suddenly, all pods are in status ContainerCreating. Upon inspection, describe pod command displays:

Warning FailedAttachVolume 10s (x6 over 29s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-95a5209c-797c-49de-ae30-9def18935393" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

After this, today I tried to delete and recreate PVCs, but similar error happens when recreating PVCs:

Warning ProvisioningFailed 20m  ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 failed to provision volume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-3b745751-ce69-446d-a094-89f84900bdbc": could not create volume in EC2: RequestCanceled: request context canceled
caused by: context deadline exceeded
 Normal  Provisioning     6m33s (x12 over 21m) ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 External provisioner is provisioning volume for claim "default/data-kafka-0"
 Warning ProvisioningFailed  6m23s (x11 over 21m) ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 failed to provision volume with StorageClass "gp2": rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Normal  ExternalProvisioning 100s (x83 over 21m)  persistentvolume-controller
Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

Upon describing pod with csi drivers:

E1013 12:43:23.647806       1 driver.go:124] "GRPC error" err=<
    rpc error: code = Internal desc = Could not detach volume "vol-0d61e5511a40db185" from node "i-0a7f1ad09359b3374": error listing AWS instances: RequestCanceled: request context canceled
    caused by: context canceled                        
E1013 12:43:23.652891       1 driver.go:124] "GRPC error" err=<
    rpc error: code = Internal desc = Could not detach volume "vol-0e37dabb932ace606" from node "i-0187ea34d2b675a5c": error listing AWS instances: RequestCanceled: request context canceled
    caused by: context deadline exceeded                                         
I1013 12:43:23.664699       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-0d61e5511a40db185" nodeID="i-0a7f1ad09359b3374"                                    
I1013 12:43:23.667103       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-0e37dabb932ace606" nodeID="i-0187ea34d2b675a5c"                                       
E1013 12:43:23.774055       1 driver.go:124] "GRPC error" err=<
    rpc error: code = Internal desc = Could not detach volume "vol-0fb663d85437897ab" from node "i-05b75e1891fb38735": error listing AWS instances: RequestCanceled: request context canceled
    caused by: context canceled                                                                              
E1013 12:43:23.776023       1 driver.go:124] "GRPC error" err=<
    rpc error: code = Internal desc = Could not detach volume "vol-0163a5d445e993518" from node "i-0187ea34d2b675a5c": error listing AWS instances: RequestCanceled: request context canceled
    caused by: context canceled

What you expected to happen?

CSI driver should reattach properly to volumes.

How to reproduce it (as minimally and precisely as possible)?

Not sure, very specific situation

Anything else we need to know?:

Is this some AWS quota block? Because of testing, I uninstalled and installed kafka chart many times, but each time there was no problem with PVCs, and then suddenly pod describe gives context deadline exceeded errors.

Environment

zjalicflw commented 1 year ago

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/214

This seems similar, however I have tried everything to solve this, no matter what I get the same error - context deadline exceeded

debdutdeb commented 1 year ago

Facing this right now

zjalicflw commented 1 year ago

Hi @debdutdeb

I managed to solve my issue by reinstalling both CoreDNS plugins and VPC CNI and EBS Driver. I updated them to a latest version. After this my kafka pods were running.

This should be easily fixed by uninstalling all addons, making sure to uninstall ones that are NOT installed through AWS addons console, install them all again and then delete some PVCs if stuck on attaching. Of course this will just work if you use dynamic provisioning. If using static, just attach and retattach volumes.

Taking a look at your PVCs, PVs, EBS volumes attached to your EKS clusters instance and carefully inspecting them should fix your problem.

You can elaborate more if you need help, I will try to do my best.

Filip

j-land commented 1 year ago

We are running into the same issue in an EKS environment.

Kubernetes version: v1.24.17-eks-4f4795d Driver version: 1.24.0 (from helm chart version aws-ebs-csi-driver-2.24.0)

I1113 08:42:54.079730       1 csi_handler.go:251] Attaching "csi-57939a06730aa4167c1609c46f5d8a3f6196360670b974e355bf2f6cf01a746c"
I1113 08:42:54.079786       1 csi_handler.go:251] Attaching "csi-b394ecc409f06a620fbce7118bdf4db434e5f359196317f98a42cdcac85eacdb"
I1113 08:42:54.080160       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-0934dc0da8301b04d" nodeID="i-0c8e24cd69c5ca516"
I1113 08:42:54.080160       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-056e1e688e7a0aa8c" nodeID="i-0c8e24cd69c5ca516"
E1113 08:43:09.080470       1 driver.go:124] "GRPC error" err=<
    rpc error: code = Internal desc = Could not attach volume "vol-056e1e688e7a0aa8c" to node "i-0c8e24cd69c5ca516": error listing AWS instances: RequestCanceled: request context canceled
    caused by: context canceled
 >
E1113 08:43:09.080469       1 driver.go:124] "GRPC error" err=<
    rpc error: code = Internal desc = Could not attach volume "vol-0934dc0da8301b04d" to node "i-0c8e24cd69c5ca516": error listing AWS instances: RequestCanceled: request context canceled
    caused by: context canceled
 >  
I1113 08:43:09.087184       1 csi_handler.go:234] Error processing "csi-b394ecc409f06a620fbce7118bdf4db434e5f359196317f98a42cdcac85eacdb": failed to attach: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1113 08:43:09.089415       1 csi_handler.go:234] Error processing "csi-57939a06730aa4167c1609c46f5d8a3f6196360670b974e355bf2f6cf01a746c": failed to attach: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I managed to solve my issue by reinstalling both CoreDNS plugins and VPC CNI and EBS Driver. ... This should be easily fixed by uninstalling all addons, making sure to uninstall ones that are NOT installed through AWS addons console, install them all again and then delete some PVCs if stuck on attaching. ...

These steps may be fine for one off cases, but this isn't feasible for our production environment. I would like to work towards a more durable fix in the ebs-csi-driver application.

j-land commented 1 year ago

@zjalicflw Can you reopen this issue?

torredil commented 1 year ago

Hi @j-land, as a first step, I recommend upgrading to the latest version of the driver, which sets a more sensible default timeout value for the external attacher. See our release notes here for more information: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md#v1250.

Beyond that, If you are still running into issues, I'd recommend enabling SDK logs via the sdkDebugLog parameter to help provide further insight into networking or auth related issues. Feel free to open a new issue if you need any help.

j-land commented 1 year ago

@torredil That's helpful, I appreciate it! Hopefully upgrading does the trick, but I'll enable SDK logs to debug if not.

nookseal commented 1 month ago

Does upgrading solved the problem? @j-land