openshift / openshift-velero-plugin

General Velero plugin for backup and restore of openshift workloads.
Apache License 2.0
48 stars 38 forks source link

On OpenShift namespace with more than 2000 resources, common-backup-plugin being executed on all items #83

Closed phuongatemc closed 2 years ago

phuongatemc commented 3 years ago

The OpenShift plugin is register for the following type in Backup

velero get plugin -n velero-ppdm | grep openshift | grep Backup

openshift.io/01-common-backup-plugin BackupItemAction openshift.io/02-serviceaccount-backup-plugin BackupItemAction openshift.io/03-pv-backup-plugin BackupItemAction openshift.io/04-imagestreamtag-backup-plugin BackupItemAction openshift.io/19-is-backup-plugin BackupItemAction

As the consequence, common-backup-plugin being executed on almost all items in the namespace. When we have around 2000 items in a namespace, the total Velero backup time would take minutes.

In our test, we created a namespace with 1800 Secrets and it took us around 12 minutes to complete (Our customer has more than 2000 resources in a namespace). The log shows "common-backup" being called on each item being backup. Since each item is a gRPC function call, the accumulate time is huge compared to non-OpenShift cluster backing up similar namespace. In our test, it took around 5 seconds to backup a namespace with 3600 Secrets on a normal Kubernetes cluster (Not OpenShift).

@sseago

jwmatthews commented 3 years ago

@phuongatemc thank you for filing this!

kdrodger commented 3 years ago

I would like to add my agreement and support behind this as well @jwmatthews . In the use-cases and testing with the teams I've been working with, scalability to very large numbers of objects, even well beyond the numbers here, is vital. Thanks.

jwmatthews commented 3 years ago

Thank you @kdrodger for the note

sseago commented 3 years ago

Looking at the common backup plugin, we're setting up to four annotations. 1) "openshift.io/backup-server-version": kubernetes version of the cluster being backed up. 2) "openshift.io/backup-registry-hostname": host:port for internal registry of the cluster being backed up 3) "openshift.io/skip-image-copy": boolean indicating whether to skip copying of internal images 4) "openshift.io/migration-registry": location of the backup or migration image registry

So 1) seems to be added for informational purposes. From what I can tell, it's not currently used by any plugins. I think this may have been used by the plugins at one time. It may be that we can get rid of it now, or at least not require that it be added to every resource, including ones that have no need for the other annotations being set here.

2) is used for image reference swapping on restore for pods and any resource type which includes a pod template (deployments, deploymentconfigs, etc., as well as for buildconfigs and imagestreams/imagestreamtags. 3) and 4) are only referenced by the ImageStream plugin.

One approach here would be to move 3) and 4) to the existing imagestream backup plugin since it's not really used elsewhere. 2) is still needed by multiple resources that don't currently have a dedicated plugin (but some which do). Once we've removed 3) and 4) from the common plugin, we could possibly update the AppliesTo() func to list all of the above-mentioned resource types (11 or so types). This would eliminate this plugin running for 2000 secrets. However, it won't have as much of an effect if there are 2000 PodTemplate items.

On Restore, things are somewhat more complicated. In addition to setting annotations for the same basic four cases, there's also a section which, for MigMigrations, we need to set migmigration and migplan labels on all restored resources. As a result, I'm not sure that we can perform the same streamlining for the restore case -- we can split out 1)-4) above, but with a separate common plugin adding labels to everything, it will probably not have much of an impact on performance. However, for the backup/restore use case (as opposed to migration), I'm assuming that backup performance has a greater day-to-day impact than restore performance.

dymurray commented 3 years ago

One approach here would be to move 3) and 4) to the existing imagestream backup plugin since it's not really used elsewhere. 2) is still needed by multiple resources that don't currently have a dedicated plugin (but some which do). Once we've removed 3) and 4) from the common plugin, we could possibly update the AppliesTo() func to list all of the above-mentioned resource types (11 or so types). This would eliminate this plugin running for 2000 secrets. However, it won't have as much of an effect if there are 2000 PodTemplate items.

I think this is a good idea. Descoping it to as few resources as possible is a good path forward even if we can't prevent everything.

On Restore, things are somewhat more complicated. In addition to setting annotations for the same basic four cases, there's also a section which, for MigMigrations, we need to set migmigration and migplan labels on all restored resources. As a result, I'm not sure that we can perform the same streamlining for the restore case

I agree with this

I'm assuming that backup performance has a greater day-to-day impact than restore performance.

I think this is a good point. In a normal env a customer will run multiple backups and hopefully never need to run a restore, but if they do it will be much less frequently so a performance difference shouldn't be too much of a big deal. Backups are run periodically, restores are not.

kdrodger commented 3 years ago

Backups are run periodically, restores are not.

Not necessarily. While that is common and your overall point is completely valid, backups are likely to be executed frequently and are the common case, restores are when performance really matters. That's when someone's neck can be on the line after a serious problem.

Production environments use B/R for a wide variety of purposes and I've worked with environments where restores are very routine, with individual systems restored sometimes multiple times per day, on purpose.

I'm just saying, yes, optimize for backup performance but restore performance also matters, a lot.

sseago commented 3 years ago

@kdrodger The thing is -- the enhancements we're talking about here -- avoiding running the common backup plugin on all resources -- is something we probably can't do with restores, since we have a reason to run the restore plugin on everything, since in the migration use case we need to label everything we restore to support rollback. So the backup plugin concern mentioned in this issue is something we're looking at dealing with, but I don't think we can do the same thin on restore.

kdrodger commented 3 years ago

Understood @sseago . Just sharing my experience.

dymurray commented 3 years ago

Wanted to add that there may be a way to skip some resources that have duplicated definitons across different api groups such as the RBAC endpoints. See: https://github.com/konveyor/oadp-operator/issues/113

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kdrodger commented 2 years ago

/remove-lifecycle stale