Open Shashank1306s opened 3 months ago
Please share more details:
velero debug
Please share more details:
- Where do you see the error "failed to get volumesnapshot..."
- Please share the entire debug bundle generated by
velero debug
- Which Velero version are you using
We are on velero 1.13.
Attaching the velero log file. clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109-logs.gz
Looks like the wait retry mechanism which hits a timeout cause the client-go generates an unexpected error (see here):
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:50Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:55Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:00Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:407","msg":"Deleting Volumesnapshot hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:426","msg":"Deleted volumesnapshot with volumesnapshotContent hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"velero-testpvc-nvvgd","namespace":"hrweb","resource":"volumesnapshots.snapshot.storage.k8s.io","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"testpvc","namespace":"hrweb","resource":"persistentvolumeclaims","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:457","msg":"1 errors encountered backup up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=hrweb, name=velero-testpvc-nvvgd): rpc error: code = Unknown desc = failed to get volumesnapshot hrweb/velero-testpvc-nvvgd: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline","level":"error","logSource":"/__w/1/s/velero/pkg/backup/backup.go:461","msg":"Error backing up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:405","msg":"Backed up 4 items out of an estimated total of 1299 (estimate will change throughout the backup)","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","progress":"","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:365","msg":"Processing item","name":"webserver-b94
Hi @Lyndon-Li , anyway we have the default timeout of 10min and the polling at an interval of 5 sec, from the logs we can see that the polling succeeded after 2-3 retries, so I don't think we hit timeout here over here.
This issue may relate to #7978.
@blackpiglet can you help explain why you think they might be correlated? I am still not clear on what is the meaning of this error message.
Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
IMO, this error means the client failed to get the VolumeSnapshot in the time limitation of the go-client. One possible reason is the VolumeSnaphotContent rePut caused by #7978 causing a significant load on the kube-apiserver.
Hi, any update on this issue? I seem to be dealing with the same issue on v1.14.0:
Warnings: <error getting warnings: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>
Errors: <error getting errors: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>
In our case, we get hit by this issue because of https://github.com/Azure/AKS/issues/4555
@curtis-baillie Could you try v1.14.1?
We already do.
What steps did you take and what happened:
There are bunch of clusters where the backup of snapshots are failing with the following error:
What did you expect to happen: Cluster seems to be healthy; we are not seeing any issue related to pod crunch or Api server in the timeframe when this error occurred.
The following information will help us better understand what's going on: Found a related GitHub issues: https://github.com/helm/helm/issues/12154 Error is getting thrown at this Code Path: https://github.com/kubernetes/client-go/blob/354ed1bc9f1f48c820fdc5e84b566acfd716cf42/rest/request.go#L619
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment:
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.