[RFE] support running restic backups concurrently

vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes

https://velero.io

Apache License 2.0

8.68k stars 1.4k forks source link

[RFE] support running restic backups concurrently #1531

Open king-jam opened 5 years ago

king-jam commented 5 years ago

Describe the solution you'd like When a single request is made to Velero to back up multiple applications/pods (ie: backing up an entire namespace), resources within the backup job are backed up sequentially, rather than backing up all resources in parallel (concurrently). This is an issue when the list of resources contains large PVs, because the backup job takes longer than desired. Want to make the job execute with workers if possible.

Environment:

Velero version (use velero version): master/1.0.0
Kubernetes version (use kubectl version): master/v1.14
Kubernetes installer & version: N/A
Cloud provider or hardware configuration: Restic/Minio
OS (e.g. from /etc/os-release): Ubuntu/CentOS

skriss commented 5 years ago

Thanks for logging this @king-jam. We think we may be able to up the number of workers running for each controller in the restic daemonset and get the desired parallelism here without any material code changes - we'll definitely queue this up, as we're planning on doing a bunch of work with the restic integration over the next release or two.

king-jam commented 5 years ago

So if I'm reading the code correctly, we will have to up the number of workers but that won't resolve the issue.

I believe the restic daemonset would be able to do parallel backups but the itemBackup code still gets executed sequentially so the PVs would still get processed in a synchronous sequential way.

I think the solution is to make the itemBackup code concurrent (# of workers) AND the code for PVs concurrent. This handles multiple pods with a single PV attached to each and the case of a single pod with many PVs attached.

ac-hibbert commented 4 years ago

Are there any timelines as to when these improvements to restic will be implemented?

duyanghao commented 4 years ago

Are there any timelines as to when these improvements to restic will be implemented?

duyanghao commented 4 years ago

Thanks for logging this @king-jam. We think we may be able to up the number of workers running for each controller in the restic daemonset and get the desired parallelism here without any material code changes - we'll definitely queue this up, as we're planning on doing a bunch of work with the restic integration over the next release or two.

@skriss How to up the number of workers running for each controller in the restic daemonset? Is there any arguments?

skriss commented 4 years ago

@duyanghao the # of workers is set at https://github.com/vmware-tanzu/velero/blob/master/pkg/cmd/cli/restic/server.go#L174 and https://github.com/vmware-tanzu/velero/blob/master/pkg/cmd/cli/restic/server.go#L191, but as @king-jam noted, this would only get us parallelism across multiple volumes within a single pod, not parallelism across pods.

ThoTischner commented 4 years ago

Any news on this? I think we are having scaling issues because of the sequential restic backups.

stephbman commented 4 years ago

@skriss based on review of some of the issues, my feeling that this may need to be linked with #1653 - what are your thoughts?

skriss commented 4 years ago

@stephbman I think this would be more around improving the performance of a single backup, since it would involve parallelizing the operations within a Velero backup.

Re: technical design, we could consider using a worker pods-like approach here but I'm not sure it's actually necessary; the existing restic daemonset can probably already handle running multiple operations simultaneously, so it'd just be a matter of having Velero trigger them in parallel rather than sequentially.