Volume Snapshots contents in the cluster are failing with "RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \\"client throttled\\"\"\n"}" and it is resulting in timeouts during backups due to continous polling for these volume snapshots. While evaluating more within this cluster we found out the Disks PVC for which these snapshot operations got triggered does not exist and its first fail with HTTP Not Found 404. But as the VolumeSnapshot and VolumeSnapshotContent resources are not cleaned up after backup is marked as timout/failed, its resulting into continuous polling from the CSI driver and futher backups gets failed with CSI driver throttling.
Sameple log where VSC polling is failed due to throttling:
CreateSnapshot for content snapcontent-1a602120-a722-4e5b-905b-232f1bee7de7 returned error: rpc error: code = Internal desc = create snapshot error: Retriable: true, RetryAfter: 252s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \"client throttled\"\n",
What did you expect to happen:
Once the polling for the VolumeSnapshot and VolumeSnapshotContent gets completed and if its still in failed state, we should delete them from the cluster before we fail/timout backups.
The following information will help us better understand what's going on:
Environment:
Velero version (use velero version): v1.9
velero-plugin-for-csi: v0.3.0
Kubernetes installer & version: AKS 1.25.6
Cloud provider or hardware configuration: Azure AKS
OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS
Kubernetes installer & version: AKS 1.25.6
Cloud provider or hardware configuration: Azure AKS
OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"
What steps did you take and what happened:
Volume Snapshots contents in the cluster are failing with "RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \\"client throttled\\"\"\n"}" and it is resulting in timeouts during backups due to continous polling for these volume snapshots. While evaluating more within this cluster we found out the Disks PVC for which these snapshot operations got triggered does not exist and its first fail with HTTP Not Found 404. But as the VolumeSnapshot and VolumeSnapshotContent resources are not cleaned up after backup is marked as timout/failed, its resulting into continuous polling from the CSI driver and futher backups gets failed with CSI driver throttling.
Sameple log where VSC polling is failed due to throttling:
CreateSnapshot for content snapcontent-1a602120-a722-4e5b-905b-232f1bee7de7 returned error: rpc error: code = Internal desc = create snapshot error: Retriable: true, RetryAfter: 252s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation SnapshotCreateOrUpdate with reason \"client throttled\"\n",
What did you expect to happen:
Once the polling for the VolumeSnapshot and VolumeSnapshotContent gets completed and if its still in failed state, we should delete them from the cluster before we fail/timout backups.
Suggestion for code path where we can add this logic: https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/b4e5fbbf5b236d132640bb08a8e0c34e1d5c662d/internal/util/util.go#LL187C1-L196C3
The following information will help us better understand what's going on:
Environment:
Velero version (use
velero version
): v1.9velero-plugin-for-csi: v0.3.0
Kubernetes installer & version: AKS 1.25.6
Cloud provider or hardware configuration: Azure AKS
OS (e.g. from
/etc/os-release
): Ubuntu 18.04.6 LTSKubernetes installer & version: AKS 1.25.6
Cloud provider or hardware configuration: Azure AKS
OS (e.g. from
/etc/os-release
): Ubuntu 18.04.6 LTSVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.