vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

AKS CSI Driver Throttling- RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \\\"client throttled\\\"\"\n"} #6209

Closed mayankagg9722 closed 1 year ago

mayankagg9722 commented 1 year ago

What steps did you take and what happened:

Volume Snapshots contents in the cluster are failing with "RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \\"client throttled\\"\"\n"}" and it is resulting in timeouts during backups due to continous polling for these volume snapshots. While evaluating more within this cluster we found out the Disks PVC for which these snapshot operations got triggered does not exist and its first fail with HTTP Not Found 404. But as the VolumeSnapshot and VolumeSnapshotContent resources are not cleaned up after backup is marked as timout/failed, its resulting into continuous polling from the CSI driver and futher backups gets failed with CSI driver throttling.

Sameple log where VSC polling is failed due to throttling:

CreateSnapshot for content snapcontent-1a602120-a722-4e5b-905b-232f1bee7de7 returned error: rpc error: code = Internal desc = create snapshot error: Retriable: true, RetryAfter: 252s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \"client throttled\"\n",

image

What did you expect to happen:

Once the polling for the VolumeSnapshot and VolumeSnapshotContent gets completed and if its still in failed state, we should delete them from the cluster before we fail/timout backups.

Suggestion for code path where we can add this logic: https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/b4e5fbbf5b236d132640bb08a8e0c34e1d5c662d/internal/util/util.go#LL187C1-L196C3

The following information will help us better understand what's going on:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 1 year ago

Close for now, and trace the issue with #6219.