vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.82k stars 1.41k forks source link

Allow not skipping Complete jobs in velero restores. #3519

Open daguilarv opened 3 years ago

daguilarv commented 3 years ago

Describe the problem/challenge you have Currently we can see in restores logs that Completed Jobs are skipped in restore procedure.

time="2021-03-04T09:57:31Z" level=info msg="baikal-system/elasticsearch-provision-zcsj4 is complete - skipping" logSource="pkg/restore/restore.go:853" restore=baikal-infra/disaster-20210304095607 time="2021-03-04T09:57:48Z" level=info msg="baikal-system/elasticsearch-provision is complete - skipping" logSource="pkg/restore/restore.go:853" restore=baikal-infra/disaster-20210304095607

Just in case you have described some jobs completion checks in other pod init-containers, pods are stuck in init status forever

Describe the solution you'd like [A clear and concise description of what you want to happen.]

I´d like a new flag in restore command to force job reapply.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

palmerabollo commented 1 year ago

Any plans to implement this feature? We currently have to remove the jobs and reapply them to simulate they are Completed during the velero restore.

sseago commented 1 year ago

It's been a while since I've looked into this, but I suspect the issue is that if Velero creates a job on restore that was previously completed, it will run again, which seems like the wrong thing to do. An already-complete job has no need to run again, and since the point of creating a job is to run it, it's not clear to me that creating an already-run job on restore is the right thing to do.

akshay-cognologix commented 3 months ago

Any update for this feature?

blackpiglet commented 3 months ago

@akshay-cognologix Could you give more information about why you want this function? As @sseago said, restoring the completed job will make them run again after restore. Usually, this is not expected.

daguilarv commented 3 months ago

@akshay-cognologix Could you give more information about why you want this function? As @sseago said, restoring the completed job will make them run again after restore. Usually, this is not expected.

Imagine a disaster recovery procedure with a namespace deletion as disaster. If you have init containers waiting for job completion in some deployments and execute a velero restore, those jobs will be never reapplied, so recovery is not completed.

blackpiglet commented 3 months ago

I see. There is a feature that allows restoring the status of the k8s resources. Please check whether that feature can resolve your issue. https://velero.io/docs/main/restore-reference/#restore-status-field-of-objects

sseago commented 3 months ago

@daguilarv So in that scenario, is it fine that the completed job runs again? I imagine for some workloads, running again may be fine. For others, perhaps not. Also note, that if you have completed jobs that were run from a CronJob, then if we restored all of these jobs, you would then have several copies of the same (previously-completed) job running at once on restore.

daguilarv commented 3 months ago

@daguilarv So in that scenario, is it fine that the completed job runs again? I imagine for some workloads, running again may be fine. For others, perhaps not. Also note, that if you have completed jobs that were run from a CronJob, then if we restored all of these jobs, you would then have several copies of the same (previously-completed) job running at once on restore.

Cronjob is not a problem, restoring the controller the job will be eventually executed. Jobs for provisioning things (one shot jobs) are the issue in my case.

sseago commented 3 months ago

@daguilarv Point was if you have a cronjob, then you may have multiple associated completed jobs, and if we restore all of those when we restore a namespace, then all of those jobs will end up running again at once. Say it's something that runs twice a day and is configured to retain the last 10 completed jobs. So you could end up with 10 copies of the "every 12 hours" job running all at once. For some workloads, that may be fine. For others, it may cause problems. I guess my overall point was that if we did decide to enable this, it would have to be configurable non-default behavior and not the behavior that occurs by default, since it has the potential to break things.

sseago commented 3 months ago

One possible solution would be to define a particular annotation and velero would only restore completed jobs with this annotation. So for ones you need to restore and re-run, you could add this annotation to your job.

daguilarv commented 3 months ago

One possible solution would be to define a particular annotation and velero would only restore completed jobs with this annotation. So for ones you need to restore and re-run, you could add this annotation to your job.

It would be an excellent solution.

akshay-cognologix commented 3 months ago

@blackpiglet @daguilarv Reason I want this feature is because my application have dependency on Jobs in k8s, few services only runs after completion of jobs, so service startup depends on it. As we are using Velero for DR then after restoration all services should be up. That's why I up-voted for this feature. Solution suggested by @sseago is excellent one!