vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.82k stars 1.41k forks source link

Allow customizing restore order for Kubernetes controllers and their managed resources #4045

Open DanielXiao opened 3 years ago

DanielXiao commented 3 years ago

Describe the problem/challenge you have [A description of the current limitation/problem/challenge that you are experiencing.] When restore targets contain Kubernetes controllers, it 's possible to hit below issues:

  1. Velero is not aware of dependencies among Custom Resources and restore them in alphabetical order. E.g., invalid memory address or nil pointer dereference
  2. Race condition between Velero and a controller when they operate the same resource. See below issue from Antrea restore:

time="2021-08-10T16:41:04Z" level=info msg="Attempting to restore Tier: securityops" logSource="pkg/restore/restore.go:1070" restore=velero/restore-48c089d0-03ed-4f30-8532-a2e9837bea94 time="2021-08-10T16:41:04Z" level=info msg="error restoring securityops: admission webhook \"tiervalidator.antrea.tanzu.vmware.com\" denied the request: tier securityops priority 50 overlaps with existing Tier" logSource="pkg/restore/restore.go:1133" restore=velero/restore-48c089d0-03ed-4f30-8532-a2e9837bea9

error restoring application: admission webhook \"tiervalidator.antrea.tanzu.vmware.com\" denied the request: tier application priority 250 overlaps with existing Tier"

Describe the solution you'd like [A clear and concise description of what you want to happen.] From default restore order, we can see controller Pods are restored before managed Custom Resources, so we may solve this problem by:

  1. Allow user to define restore order for Custom Resource per restore.
  2. Mark controller Pod/Deployment with labels and remove them from the ordered list and append them to the end (before any managed resources).

As for Antrea cluster, the order should be default restore order -> Tier CR -> Other Antrea CRs -> Antrea controller Pod -> Antrea controller replicaset and deployment -> Antrea MutatingWebhookConfiguration and ValidatingWebhookConfiguration

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Nowadays there are tons of workloads consist of controllers and operators, both disaster recovery and migration might hit this issue.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

vineetsingh5 commented 2 years ago

We are also facing same issue with rancher cluster. We have triggered restore multiple time using same backup but sometime restore failed with similar error (nil pointer dereference).

bluzarraga commented 5 months ago

Hello, we are also interested in some kind of ordering mechanism through the velero restore object. Is there any plan to implement this feature or something like it given the age of this issue @reasonerjt?

kaovilai commented 5 months ago

I'm interested in helping with this. One would just be adding restore priority field to restore CR, and if empty, use server restore priorities.