vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.4k stars 1.37k forks source link

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Open kaovilai opened 1 month ago

kaovilai commented 1 month ago

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7207

Please indicate you've done the following:

kaovilai commented 1 month ago

Todo: fix unit tests on mock assertions.

--- FAIL: Test_restoreFinalizerReconciler_finishProcessing (0.33s)
    /Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:581: 
            Error Trace:    /Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:581
                                        /Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:611
            Error:          Not equal: 
                            expected: 2
                            actual  : 4
            Test:           Test_restoreFinalizerReconciler_finishProcessing
            Messages:       Expected number of calls (2) does not match the actual number of calls (4).
    --- FAIL: Test_restoreFinalizerReconciler_finishProcessing/restore_failed_to_patch_status,_should_retry_on_connection_refused (0.33s)
        /Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:609: restoreFinalizerReconciler.finishProcessing() error = connection refused, wantErr false
        /Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:612: mockClientAsserts() failed
FAIL
FAIL    github.com/vmware-tanzu/velero/pkg/controller   1.090s
FAIL
codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 3.33333% with 29 lines in your changes are missing coverage. Please review.

Project coverage is 58.72%. Comparing base (33633d8) to head (1587186). Report is 4 commits behind head on main.

Files Patch % Lines
pkg/client/retry.go 0.00% 20 Missing :warning:
pkg/util/kube/client.go 0.00% 6 Missing :warning:
pkg/controller/backup_controller.go 0.00% 0 Missing and 1 partial :warning:
pkg/controller/backup_finalizer_controller.go 0.00% 0 Missing and 1 partial :warning:
pkg/controller/restore_controller.go 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7845 +/- ## ========================================== - Coverage 58.79% 58.72% -0.07% ========================================== Files 345 345 Lines 28764 28785 +21 ========================================== - Hits 16911 16905 -6 - Misses 10425 10451 +26 - Partials 1428 1429 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

kaovilai commented 1 month ago

Received feedback to expand retry to other status patches and potentially for backup too. Will ask for feedback in issue.

Missxiaoguo commented 3 weeks ago

We are working on making each k8s API client call to be retriable in case of internal server errors caused by temporary API outages. The client-go offers the capability to implement a custom middleware transport for adding extra behavior or processing to the default http transport (https://github.com/kubernetes/client-go/blob/master/transport/config.go#L68). We are utilizing it to wrap the default http request with retries for specific error types https://github.com/openshift-kni/lifecycle-agent/pull/548/files#diff-99516c035075962960ff61611a293268a97b3a6535639e92580d0ef8de6eb8cf. Adding retry logic as middleware enhances the resilience of each API client call. Not sure if your team considers this solution!

kaovilai commented 3 weeks ago

That's cool. Will check, thanks @Missxiaoguo

sseago commented 3 weeks ago

I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.

Missxiaoguo commented 3 weeks ago

I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.

yeah, my understanding is that the hang depends on the duration of API server outage regardless of small or big restore. The difference with a big restore is more underlying requests based on the number of resources. But it's very reasonable for your guys to be cautious!

kaovilai commented 3 weeks ago

Trying Requeue + Fail without velero pod restart in another PR and we can discuss if we want this retry PR later.

kaovilai commented 3 weeks ago

From community meeting, this PR is currently considered too specific and not applicable to most users and for the users that it applies to, backups during apiserver outage is considered risky.

kaovilai commented 3 weeks ago

Requeue approach at https://github.com/vmware-tanzu/velero/pull/7863