vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Cleanup CSI Snapshots post Polling of VSC times out in CSI Plugin. #6219

Closed anshulahuja98 closed 1 year ago

anshulahuja98 commented 1 year ago

What steps did you take and what happened:

As of today the CSI plugin waits for the VSC to get reconciled for default 10minutes. After that CSI leaves the VS in the cluster. This leads to accumulation of useless VS in the cluster which ends up throttling the CSI drivers installed in the cluster. Because the CSI driver keeps on retrying endlessly.

What did you expect to happen: Cleanup the VS if tracking has failed in the CSI plugin.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Related issue #6209 and also in similar code path as #6165

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 1 year ago

ends up throttling the CSI drivers installed in the cluster. Because the CSI driver keeps on retrying endlessly.

Could you give more background about CSI drivers reaching their throttle when there is enough legacy VS left? Although VSs are accumulated, they should end with ReadyToUse state, do CSI drivers still poll on those VSs?

anshulahuja98 commented 1 year ago

https://github.com/vmware-tanzu/velero/issues/4819 https://github.com/kubernetes-csi/external-snapshotter/issues/533

The polling is controlled by the external snapshotter. That will relentlessly try. Have verified from logs. Earlier even exponential backoff was broken in external snapshotter, it is fixed now (AFAIK), but still the expectation is that "checkSnapshotANdUpdate" there is a function like this which is periodically called and if a VS/VSC is not ready, it will attempt to fix it.

blackpiglet commented 1 year ago

Do you mean this function checkandUpdateContentStatusOperation? If we are talking about this function, checking the not-ready VolumeSnapshotContent or VolumeSnapshot seems inevitable, and that only happens for the not-yet-ready VolumeSnapshotContents. The accumulated VolumeSnapshotContents should have no impact on this.

anshulahuja98 commented 1 year ago

I can check back on the exact functionality on code pointers(will take me some time to look back at the code), but I am confident on the behavior of external snapshotter retrying endlessly. We have verified multiple times from ext snapshotter logs.

Keeping that investigation aside do you see any merit otherwise in leaving the orphan VS in the cluster? Irrespective we should still cleanup the VS which are not being used/wont be GCed from the cluster.

blackpiglet commented 1 year ago

Those VolumeSnapshots are kept for backup deletion. They are kept for deleting the snapshot resources from cloud providers. I agree with you that this is not an ideal way, and they should be cleaned. With the current design, if we delete those VolumeSnapshots, the VolumeSnapshots should be restored before deleting the backup to prevent snapshots left on providers forever, and I didn't find a fitting place to do that.

anshulahuja98 commented 1 year ago

Those VS are not even tracked as part of the backup on failure, the BIA returns nil

So during GC these VS is not cleaned up. They are just orphan VS.

anshulahuja98 commented 1 year ago

My plan is to delete the VS on this line on error https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/b4e5fbbf5b236d132640bb08a8e0c34e1d5c662d/internal/backup/volumesnapshot_action.go#LL95C1-L95C1

blackpiglet commented 1 year ago

If we are talking about the error case, I agree those VSs should be cleaned in this place.

Lyndon-Li commented 1 year ago

Close as completed.