Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

Shashank1306s commented 3 months ago

What steps did you take and what happened:

There are bunch of clusters where the backup of snapshots are failing with the following error:

What did you expect to happen: Cluster seems to be healthy; we are not seeing any issue related to pod crunch or Api server in the timeframe when this error occurred.

The following information will help us better understand what's going on: Found a related GitHub issues: https://github.com/helm/helm/issues/12154 Error is getting thrown at this Code Path: https://github.com/kubernetes/client-go/blob/354ed1bc9f1f48c820fdc5e84b566acfd716cf42/rest/request.go#L619

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version):
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

Lyndon-Li commented 3 months ago

Please share more details:

Where do you see the error "failed to get volumesnapshot..."
Please share the entire debug bundle generated by velero debug
Which Velero version are you using

Shashank1306s commented 2 months ago

Please share more details:

Where do you see the error "failed to get volumesnapshot..."

Please share the entire debug bundle generated by velero debug

Which Velero version are you using

We are on velero 1.13.

Attaching the velero log file. clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109-logs.gz

Lyndon-Li commented 2 months ago

Looks like the wait retry mechanism which hits a timeout cause the client-go generates an unexpected error (see here):

{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:50Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:55Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:00Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:407","msg":"Deleting Volumesnapshot hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:426","msg":"Deleted volumesnapshot with volumesnapshotContent hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"velero-testpvc-nvvgd","namespace":"hrweb","resource":"volumesnapshots.snapshot.storage.k8s.io","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"testpvc","namespace":"hrweb","resource":"persistentvolumeclaims","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:457","msg":"1 errors encountered backup up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=hrweb, name=velero-testpvc-nvvgd): rpc error: code = Unknown desc = failed to get volumesnapshot hrweb/velero-testpvc-nvvgd: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline","level":"error","logSource":"/__w/1/s/velero/pkg/backup/backup.go:461","msg":"Error backing up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:405","msg":"Backed up 4 items out of an estimated total of 1299 (estimate will change throughout the backup)","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","progress":"","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:365","msg":"Processing item","name":"webserver-b94

Shashank1306s commented 2 months ago

Hi @Lyndon-Li , anyway we have the default timeout of 10min and the polling at an interval of 5 sec, from the logs we can see that the polling succeeded after 2-3 retries, so I don't think we hit timeout here over here.

blackpiglet commented 2 months ago

This issue may relate to #7978.

anshulahuja98 commented 2 months ago

@blackpiglet can you help explain why you think they might be correlated? I am still not clear on what is the meaning of this error message.

blackpiglet commented 2 months ago

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

IMO, this error means the client failed to get the VolumeSnapshot in the time limitation of the go-client. One possible reason is the VolumeSnaphotContent rePut caused by #7978 causing a significant load on the kube-apiserver.

curtis-baillie commented 1 month ago

Hi, any update on this issue? I seem to be dealing with the same issue on v1.14.0:

Warnings:  <error getting warnings: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>

Errors:  <error getting errors: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>

veerendra2 commented 1 week ago

In our case, we get hit by this issue because of https://github.com/Azure/AKS/issues/4555

blackpiglet commented 1 week ago

@curtis-baillie Could you try v1.14.1?

monotek commented 1 week ago

We already do.

vmware-tanzu / velero

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991