Setting Restart policy of backup job to onfailure

charleszheng44 commented 1 year ago

Feature Request

Is your feature request related to a problem? Please describe:

We found that when we killed a backup pod the backup job went to failed state without retrying. This could happen when we do a node upgrade or a node was down. The reason is that the BackoffLimit is always set to 0 and the RestartPolicy of the pod is never https://github.com/pingcap/tidb-operator/blob/ec8974c534d6beeedc68a9106b82ec45fbab3d90/pkg/backup/backup/backup_manager.go#L660. Unfortunately we can’t modify it for now. Is it possible that we can set it in our Backup CR?

Describe the feature you'd like:

Allow setting the restart policy of the backup job to RestartPolicyOnFailure

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

charleszheng44 commented 1 year ago

@csuzhangxc

csuzhangxc commented 1 year ago

@WizardXiao will you implement this?

WizardXiao commented 1 year ago

ok, i will try to implement this.

charleszheng44 commented 1 year ago

@WizardXiao Can I know the ETA of this feature?

WizardXiao commented 1 year ago

v1.4.4 has support retries for snapshot backups in case of unexpected failures caused by Kubernetes job or pod issues. It takes effect through spec.backoffRetryPolicy, Ref https://docs.pingcap.com/tidb-in-kubernetes/stable/backup-restore-cr#general-fields