Closed pixelsoccupied closed 1 month ago
@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
/cc @Missxiaoguo
As discussed we are being a bit more selective with validateIBUSpec
since the likely reason of error from it is cause of user error
But we are still requeueing with known network error (about 4 types)...and they of course cover the error we are focusing here i.e ..connection refused
due to API server down
/hold
doing a bit of testing with the new change
/unhold
/jira refresh
@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @yliu127
/retest
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jc-rh
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@pixelsoccupied: Jira Issue OCPBUGS-32493: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-32493 has been moved to the MODIFIED state.
Background / Context
Cluster can expect to run into unexpected network issues and it's up to the controller to eventually reach desired state
Issue / Requirement / Reason for change
We ran into network outage around the time when
Prep
was trying toGet
adeployment
CR for the first time. Because it was the first time...controller needed to reach out to the API server in order to init a new informer. At which point we seeSolution / Feature Overview
Controller should allow time for the API server outage to heal and retry until then
Implementation Details
validateIBUSpec
we are being selective with the requeue since it's more likely from user rather than networkPrep
stage state.Other Information
Tip
To force an API outage
(At this you need to check with crictl for logs)
/cc @donpenney