vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.6k stars 1.39k forks source link

Backing up resources in parallel #2888

Open phuongatemc opened 4 years ago

phuongatemc commented 4 years ago

Currently Velero Backup processes resources in serial. In some scenarios, we would like to back up resources in parallel, not only to increase performance but also help reduce the time gap between backup time of the items. For example, to backup a Cassandra cluster of 20 pods (each has 1 PVC). The Backup of such cluster woud take snapshots of PVCs belong to these Pods and to help application consistency, these PVCs should be snapshotted as close to each other as possible (either in parallel or in a single volume group, depending on what storage back-end supported).

So the enhancement request is to allow users to specify the resource types (Kind) to be backed up in parallel. For example, we can enhance an option say "ConcurrentResources" and users can specify ConcurrentResources: "pods". Then during backup, we will create goroutine to backup all Pods in parallel.

This feature may conflict with the "OrderedResources" feature which we Backup resources of specific Kind in specific ordered. So these two "OrderedResources" and "ConcurrentResources" cannot specify the same Kind.

Another aspect can also be considered here is the level of concurrency allowed. For example if the back end system can only allow up to 10 PVC snapshots being taken in parallel or the backup storage device can allow 10 write streams in parallel then Backup cannot create more backup goroutines than such limit. This also raise the issue of multiple Backups in parallels and we need to factor in the limitation mentioned above when creating goroutines.

An alternative solution would be VolumeGroup that is currently proposed in the Kubernetes Data Protection Working Group. This VolumeGroup allows grouping together related PV (so they can be snapshotted together...).

phuongatemc commented 3 years ago

We ultimately want all the PVCs belong to the pods of the same application (in the same namespace) being snapshotted in parallel. However in Velero current implementation, backup item will backup pod and its PVC, PV together before moving to the next pod, we can make it parallel at the level of pod which should be good enough because each pod usually have 1 PVC-PV.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zubron commented 3 years ago

This is still needed.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

Closing the stale issue.

jglick commented 2 years ago

if the back end system can only allow up to 10 PVC snapshots being taken in parallel or the backup storage device can allow 10 write streams in parallel then Backup cannot create more backup goroutines than such limit

Not necessarily; this can be handled simply by having the goroutine wait & retry in a provider-specific manner. https://github.com/jglick/velero-plugin-for-aws/commit/b5d7c526ec7ab806577134454902a1efb076f2cf seems to work in the case of EBS snapshots. Needed when the number of PVs in the backup gets into the range of dozens.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

Closing the stale issue.

fabiorauber commented 2 years ago

This issue is still relevant.

onewithname commented 11 months ago

This is still very relevant IMO. Is there any progress on this topic/issue?

Lyndon-Li commented 11 months ago

With 1.12, the time consuming actions (data related actions) in backup/restore are running in parallel, i.e., CSI snapshot creation for PVCs, data movement for PVCs. There is still one legacy area we haven't touched --- volumes from different pods are still processed in sequence for fs backup. We may improve this in future.

The resource backups/restores are not planed to go in parallel, since the resources are small, we don't foresee much performance benefits to make it go concurrently.

onewithname commented 11 months ago

With 1.12, the time consuming actions (data related actions) in backup/restore are running in parallel, i.e., CSI snapshot creation for PVCs, data movement for PVCs. There is still one legacy area we haven't touched --- volumes from different pods are still processed in sequence for fs backup. We may improve this in future.

The resource backups/restores are not planed to go in parallel, since the resources are small, we don't foresee much performance benefits to make it go concurrently.

Thanks for the update!

Some information to preface my next question. Also apologies if this is not directly related or off-topic. Please let me know if I should open it as separate issue.

In the environment I am managing we are running PowerProtect Data Manager (Dell backup tool for Kubernetes/OpenShift). We are struggling with backup performance and the main point of congestion I see (as well as is reported by Dell support) is the metadata backup by Velero (CRDs, secrets, etc.) taking ages.

Example: We have 50 namespaces each with 60 "backup items" and no PVCs to be backed up. In Velero each namespace is processed at rate of 1 resource/second - which takes 50 minutes for everything to complete since everything is sequential. Issue gets even worse when working with more namespaces or higher resource counts.

Isn't this something where having resource backups be parallel would greatly improve the performance?

weshayutin commented 11 months ago

With 1.12, the time consuming actions (data related actions) in backup/restore are running in parallel, i.e., CSI snapshot creation for PVCs, data movement for PVCs. There is still one legacy area we haven't touched --- volumes from different pods are still processed in sequence for fs backup. We may improve this in future.

The resource backups/restores are not planed to go in parallel, since the resources are small, we don't foresee much performance benefits to make it go concurrently.

I would argue that concurrent backups have been a topic in recent community calls. I can say that engineers from Red Hat are certainly interested in the topic and we're working on potential proposals. Perhaps @onewithname can join a few community calls to highlight their perspective, use case and requirements.

sseago commented 11 months ago

@onewithname One thing you could do to help gauge how much of a speed-up you might be able to see with parallel resource backups: If you install two separate velero instances (in separate namespaces) and run two 25-namespace backups at the same time (one in each velero), how long does it take before both are complete? If velero threading is the bottleneck, then I'd expect completion in closer to 25 minutes than in 50, but if the APIServer is the bottleneck, then you may not see much improvement. That would help us to determine the potential value of this feature.

Lyndon-Li commented 11 months ago

It is worthy trying what @sseago mentioned to find the bottleneck first. 1 resource/second doesn't look like a normal performance.

sseago commented 11 months ago

For backups, there's an APIServer List call per-resource-type, per namespace. In the test case where you have 50 namespaces and 60 items per namespace, there will be quite a few apiserver calls -- of those 60 items in a namespace how many resource types are represented? It may be that you're making 1 call for every 5 items or so, on average. 1 second per item is still pretty slow, though. I've done some test backups/restores with a large number of items in a single namespace -- 30k secrets. That's 10x as many backup items as you have (50x60, so you have only 3k items), but at least on backup there's a small fraction of the apiserver calls. Backup takes about 2 minutes. On restore, where there are 2 apiserver calls per item (Create and Patch), it takes about 50 minutes, which is about 10x faster per item than you're seeing on backup.

Does restore take as long as backup for you?

sseago commented 11 months ago

That being said, if running 2 velero instances increases your backup performance, then that suggests that for your use case, backing up multiple resources at a time in a single backup will significantly improve performance for your use case. At the same time, there may be some cluster performance issues in your environment that should be sorted, or maybe your velero pod needs more memory or CPU resources. It could be that your velero pod is CPU-limited or something similar.

onewithname commented 11 months ago

I would argue that concurrent backups have been a topic in recent community calls. I can say that engineers from Red Hat are certainly interested in the topic and we're working on potential proposals. Perhaps @onewithname can join a few community calls to highlight their perspective, use case and requirements.

I would be happy to assist if needed!

@onewithname One thing you could do to help gauge how much of a speed-up you might be able to see with parallel resource backups: If you install two separate velero instances (in separate namespaces) and run two 25-namespace backups at the same time (one in each velero), how long does it take before both are complete? If velero threading is the bottleneck, then I'd expect completion in closer to 25 minutes than in 50, but if the APIServer is the bottleneck, then you may not see much improvement. That would help us to determine the potential value of this feature.

As I have mentioned before in this environment I am using Dell PPDM backup solution - which is "orchestrating" and managing everything. So I don't have the flexibility of running multiple instances of Velero - as it is also managed by the tool. However I will look into if it would be possible to get test you describe arranged as standalone case.

As for API bottleneck - I am no OpenShift expert, so not really sure how to gauge that? (we are running on-premises OpenShift if that matters)

on Velero logs I see, this type of messages: I1027 03:57:07.180380 1 request.go:601] Waited for 1.045330997s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/rbac.istio.io/v1alpha1?timeout=32s I1027 03:57:17.230221 1 request.go:601] Waited for 3.844616025s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/operators.coreos.com/v2?timeout=32s

but out of 77k log lines there are 44 entries of "due to client-side throttling", and they between when switching from one namespace to another. So do not think that would be that impactful, might be wrong though.

I've done some test backups/restores with a large number of items in a single namespace -- 30k secrets. That's 10x as many backup items as you have (50x60, so you have only 3k items), but at least on backup there's a small fraction of the apiserver calls. Backup takes about 2 minutes.

I have observed similar performance in my environment as well. Where Velero backs up 3000 resources in 5 seconds, than takes 3 minutes to go from 3000 to 3100.

On restore, where there are 2 apiserver calls per item (Create and Patch), it takes about 50 minutes, which is about 10x faster per item than you're seeing on backup.

Does restore take as long as backup for you?

The restores I have performed usually takes about the same time what I see on the backup when comparing on the same namespace.

It could be that your velero pod is CPU-limited or something similar.

Whenever I checked metrics on the velero pods they never even went up to 20-30% of provisioned resources. I was using default values before, before last night I increased x4 times - but did not observe any improvement.

ihcsim commented 10 months ago

@sseago Running multiple Velero replicas isn't an option atm, because we are using OADP, which AIUI, hard-coded the Velero replica configuration.

From the Velero logs (at least those that I examined), it doesn't look like the LIST calls to the API server is the bottleneck. The latency seems to come from the backup action. We are seeing the backup log line of Backed up M items out of an estimated total of N at every 1 second interval, for every item of each resource kind.

In some namespaces, for every item to be backed up, we are also seeing frequent occurrences of msg="[common-backup] Error in getting route: configmaps "oadp-registry-config" not found. Assuming this is outside of OADP context.". I assume that has something to do with us no backing up images, but not sure why it will be relevant to resource kinds like pods, service accounts etc.

ihcsim commented 10 months ago

@onewithname It will be interesting to see how your API server is performing (cpu, memory, throttling logs etc.). The Velero logs you posted showed only client-side throttling. The API server could also be doing even more server-side throttling.

sseago commented 10 months ago

@ihcsim I didn't mean multiple replicas (that doesn't work) -- I meant multiple velero installs (i.e. multiple OADP installs, in different namespaces). In any case, I was not proposing multiple installs as the solution -- but as a way of getting data. If you had 2 velero/OADP installs, you could run 2 backups at once, and we could see whether parallel resource backup in your particular environment with slow per-item backup rates, actually increased the throughput.

As for that "Error in getting route" message, it looks like that's coming from the oadp-1.0 imagestream plugin -- so: 1) it only runs on imagestream resources in the backup. 2) That particular message doesn't exist in versions of OADP newer than 1.0