Closed anshulahuja98 closed 1 year ago
ends up throttling the CSI drivers installed in the cluster. Because the CSI driver keeps on retrying endlessly.
Could you give more background about CSI drivers reaching their throttle when there is enough legacy VS left?
Although VSs are accumulated, they should end with ReadyToUse
state, do CSI drivers still poll on those VSs?
https://github.com/vmware-tanzu/velero/issues/4819 https://github.com/kubernetes-csi/external-snapshotter/issues/533
The polling is controlled by the external snapshotter. That will relentlessly try. Have verified from logs. Earlier even exponential backoff was broken in external snapshotter, it is fixed now (AFAIK), but still the expectation is that "checkSnapshotANdUpdate" there is a function like this which is periodically called and if a VS/VSC is not ready, it will attempt to fix it.
Do you mean this function checkandUpdateContentStatusOperation
?
If we are talking about this function, checking the not-ready VolumeSnapshotContent or VolumeSnapshot seems inevitable, and that only happens for the not-yet-ready VolumeSnapshotContents. The accumulated VolumeSnapshotContents should have no impact on this.
I can check back on the exact functionality on code pointers(will take me some time to look back at the code), but I am confident on the behavior of external snapshotter retrying endlessly. We have verified multiple times from ext snapshotter logs.
Keeping that investigation aside do you see any merit otherwise in leaving the orphan VS in the cluster? Irrespective we should still cleanup the VS which are not being used/wont be GCed from the cluster.
Those VolumeSnapshots are kept for backup deletion. They are kept for deleting the snapshot resources from cloud providers. I agree with you that this is not an ideal way, and they should be cleaned. With the current design, if we delete those VolumeSnapshots, the VolumeSnapshots should be restored before deleting the backup to prevent snapshots left on providers forever, and I didn't find a fitting place to do that.
Those VS are not even tracked as part of the backup on failure, the BIA returns nil
So during GC these VS is not cleaned up. They are just orphan VS.
My plan is to delete the VS on this line on error https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/b4e5fbbf5b236d132640bb08a8e0c34e1d5c662d/internal/backup/volumesnapshot_action.go#LL95C1-L95C1
If we are talking about the error case, I agree those VSs should be cleaned in this place.
Close as completed.
What steps did you take and what happened:
As of today the CSI plugin waits for the VSC to get reconciled for default 10minutes. After that CSI leaves the VS in the cluster. This leads to accumulation of useless VS in the cluster which ends up throttling the CSI drivers installed in the cluster. Because the CSI driver keeps on retrying endlessly.
What did you expect to happen: Cleanup the VS if tracking has failed in the CSI plugin.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Related issue #6209 and also in similar code path as #6165
Environment:
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.