Improve CSI Snapshotting Performance

anshulahuja98 commented 1 year ago

Describe the problem/challenge you have

Currently, CSI snapshot through Velero is done in a linear manner with a default timeout of 10 minutes. In negative scenarios, this can lead to the backup job taking N*10 minutes to complete, where N is the number of PVCs.

Describe the solution you'd like

The proposal is to batch CSI snapshots in groups of (N/10) at a time and track them in parallel to avoid delays caused by errors encountered while snapshotting. In addition, there is a suggestion to introduce pluggability for various platforms to plug in their own set of retryable/non-retryable errors based on which the CSI snapshot polling mechanism can fail early and exit.

Anything else you would like to add:

Discussed in depth in slack thread: https://kubernetes.slack.com/archives/C021GPR1L3S/p1681899379315259

Environment:

Velero version (use velero version):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "The project would be better with this feature added"
:-1: for "This feature will not enhance the project in a meaningful way"

anshulahuja98 commented 1 year ago

CC: @Lyndon-Li , @blackpiglet , @reasonerjt

anshulahuja98 commented 1 year ago

As discussed in slack - the plan is to wait for BIAv2 porting of the CSI plugin. There this issue can be addressed.

reasonerjt commented 1 year ago

I think the challenge is how to handle "post hook", b/c we need to wait for the handle to exist before triggering the hook, therefore it may not be feasible to move the code to wait for the handle to finalizing phase.

@anshulahuja98 what's your thought?

anshulahuja98 commented 1 year ago

Yes exactly.

Given the limitation of waiting for snapshotHandle in core flow, the reduced scope will be to aim for parallelization in that polling

As of today snapshotting happens in context of a pod. hence when we discover there are say 3 PVCs for a pod, we snapshot and poll linearly, whereas potentially this can be improved if we snapshot them in parallel and wait in parallel for snapHandle. Then we move on to the next pod and snapshot all the volumes of this pod together.

Example to illustratte: (Pod A has 3 PVCs, Pod B has 2 PVCs, Pod C has 2 PVC) Today we wait for 3+2+2 * 10 mins = 70 mins this can go down to 10mins( Pod A, track 3 in parallel), 10 mins (Pod B, track 2 in parallel), 10 mins (Pod C, track 2 in parallel) = 30 mins

If there are no hooks on pods using PVCs, we snapshot all of them together and then wait on them together.

In couple of community meetings back @sseago suggested we can perhaps write a BIA on Pod for achieving this functionality.

Would request to keep this open for 1.13, we'll try to see if we can come to approach closure. Unfortunately I also don't have full clarity for now on the impl.

anshulahuja98 commented 1 year ago

In 1.13 timelines, the plan is to close on the approach in discussion with @sseago. I have already tried one draft approach - https://github.com/vmware-tanzu/velero/pull/6860 Working with Scott to try out the Pod BIA Approach, we are currently seeing some challenges there, CC: @reasonerjt, @ywk253100

Lyndon-Li commented 1 year ago

Thanks for all the efforts! Let me add this issue to 1.13 milestone since we have started working on it.

anshulahuja98 commented 12 months ago

We have some consensus over the approach, but design doc closure will go post 1.13

Earlier plan was also to close on approach in 1.13

vmware-tanzu / velero

Improve CSI Snapshotting Performance #6165