vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.42k stars 1.37k forks source link

Maintenance job need have the timeout mechanism, abnormal detection, and grace shutdown #7748

Open qiuming-best opened 2 months ago

qiuming-best commented 2 months ago

Describe the problem/challenge you have

Describe the solution you'd like

Currently, the Maintenance job does not have a timeout mechanism, maybe the job runs a long time abnormally.

We need to detect the abnormal status of maintenance job, let it fail early by deleting the jobs

reasonerjt commented 2 months ago

If the job fails to start, the timeout mechanism probably won't help much, we just need to make sure the jobs won't pile up on the k8s side.

If the job started and took longer to finish, considering the job is running repository maintenance, it may be dangerous to kill the process b/c it may cause inconsistency in the repository.