Open kaovilai opened 1 month ago
Todo: fix unit tests on mock assertions.
--- FAIL: Test_restoreFinalizerReconciler_finishProcessing (0.33s)
/Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:581:
Error Trace: /Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:581
/Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:611
Error: Not equal:
expected: 2
actual : 4
Test: Test_restoreFinalizerReconciler_finishProcessing
Messages: Expected number of calls (2) does not match the actual number of calls (4).
--- FAIL: Test_restoreFinalizerReconciler_finishProcessing/restore_failed_to_patch_status,_should_retry_on_connection_refused (0.33s)
/Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:609: restoreFinalizerReconciler.finishProcessing() error = connection refused, wantErr false
/Users/tiger/git/velero/pkg/controller/restore_finalizer_controller_test.go:612: mockClientAsserts() failed
FAIL
FAIL github.com/vmware-tanzu/velero/pkg/controller 1.090s
FAIL
Attention: Patch coverage is 3.33333%
with 29 lines
in your changes are missing coverage. Please review.
Project coverage is 58.72%. Comparing base (
33633d8
) to head (1587186
). Report is 4 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Received feedback to expand retry to other status patches and potentially for backup too. Will ask for feedback in issue.
We are working on making each k8s API client call to be retriable in case of internal server errors caused by temporary API outages. The client-go offers the capability to implement a custom middleware transport for adding extra behavior or processing to the default http transport (https://github.com/kubernetes/client-go/blob/master/transport/config.go#L68). We are utilizing it to wrap the default http request with retries for specific error types https://github.com/openshift-kni/lifecycle-agent/pull/548/files#diff-99516c035075962960ff61611a293268a97b3a6535639e92580d0ef8de6eb8cf. Adding retry logic as middleware enhances the resilience of each API client call. Not sure if your team considers this solution!
That's cool. Will check, thanks @Missxiaoguo
I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.
I'm thinking that we want to be careful here. We absolutely want retry on the status transitions away from InProgress and towards terminal states, but if we retry for up to 2 minutes for item restore, then a bad APIServer could cause a Restore to hang for hours/days for a big restore, when what we want to see there is mark the error and move on.
yeah, my understanding is that the hang depends on the duration of API server outage regardless of small or big restore. The difference with a big restore is more underlying requests based on the number of resources. But it's very reasonable for your guys to be cautious!
Trying Requeue + Fail without velero pod restart in another PR and we can discuss if we want this retry PR later.
From community meeting, this PR is currently considered too specific and not applicable to most users and for the users that it applies to, backups during apiserver outage is considered risky.
Requeue approach at https://github.com/vmware-tanzu/velero/pull/7863
Signed-off-by: Tiger Kaovilai tkaovila@redhat.com
Thank you for contributing to Velero!
Please add a summary of your change
Does your change fix a particular issue?
Fixes #7207
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.