vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.61k stars 1.39k forks source link

Support detect k8s resource dependency during backup #7199

Open blackpiglet opened 9 months ago

blackpiglet commented 9 months ago

Describe the problem/challenge you have

It's better to know the backed-up k8s resource's dependency. If the Velero server knows it, it can detect invalid backups before running the backup process. This feature can help to resolve the scenario described in PR #7045.

This feature also has benefits for

Describe the solution you'd like

The Velero server can use a DAG(Directed Acyclic Graphs) as the data structure to store the backup resources.

The DAG's content should be:

Say this string represents a DAG, The resource backup sequence should ordered from left to right.

e > f, g > h;

The DAG should be generated by existing rules:

During generating the rules, if the later rules violate the existing DAG resource hierarchy, fail the backup, and warn the user the rule is invalid.

When taking the backup, it should start from the root node, and go through the root node's children. After that, traverse the children's children. If backup gets a resource, but the resource's parents are not all backed up yet, the Velero server should put it on hold, and go on, then the Velero server should retry with the on-hold resources before traversing the next layer of resources.

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 9 months ago

I put some consideration to the BIA scenario here. The BIA is different because the Velero collects all cared resources at the start of the backup. Still, the BIAs are executed during the backup process, and return more additional resources also included in the backup.

That means the Velero server cannot determine the whole scope of the backup, until the backup finishes. That makes supporting parallel backups not possible because the Velero server cannot detect the backups' overlap and potential conflicts.

There is also some discussion about making the BIA add a new method to return the additional resources during the backup resource collecting stage. I think it cannot resolve the issue.

I think the real problem BIA caused is that the Velero server cannot know what the BIAs do. If the BIA freezes the filesystem of a pod that is not included in the backup, although IMO it shouldn't happen, it will impact parallel filesystem backups.

Unfortunately, as an external binary, it's not possible to regulate the plugins' behavior. IMO, we can only give a guideline of how the plugins should work to make the parallel backups work.

reasonerjt commented 7 months ago

I think how to define "dependency" is a topic may cause a lot of debating, and is very complicated considering the customer resource. As for the data structure to track the dependency, there's a design that has been merged: https://github.com/vmware-tanzu/velero/blob/main/design/graph-manifest.md We may consider use this data structure to solve specific problems, instead of trying to introduce a generic approach to handle all resources.

Marking this as "ice-box" as we may need more concrete use cases and handle them separately.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

blackpiglet commented 5 months ago

Not stale.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

kaovilai commented 3 months ago

unstale

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

blackpiglet commented 1 month ago

unstale