Closed dkhater-redhat closed 4 days ago
@dkhater-redhat: This pull request references Jira Issue OCPBUGS-31255, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dkhater-redhat
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/retest-required
/retest-required
So taking a look back at this, i tried implementing changes by modifying the writer in the daemon by adding a "retry" when applying configurations. However, with that, and increased logging within the update.go sync loop, i noticed that the node degradation was occuring prior to the update loop. To me, this confirms that what we were seeing was not caused in the daemon. I reverted back to my original method of adding a sleep within the node controller. Not only did i see no degradations, but the mcp update time decreased by about 8x. This leads me to believe that there could be a number of issues at play here:
I do not mind playing with how we introduce this sleep, but I do believe this is causing the error at hand.
/retest-required
@dkhater-redhat: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-vsphere-ovn-zones | fc868bd1973d7be4646844e6b7355af2cd5b09a2 | link | false | /test e2e-vsphere-ovn-zones |
ci/prow/unit | fc868bd1973d7be4646844e6b7355af2cd5b09a2 | link | true | /test unit |
ci/prow/e2e-aws-ovn-upgrade-out-of-change | fc868bd1973d7be4646844e6b7355af2cd5b09a2 | link | false | /test e2e-aws-ovn-upgrade-out-of-change |
ci/prow/e2e-vsphere-ovn-upi | fc868bd1973d7be4646844e6b7355af2cd5b09a2 | link | false | /test e2e-vsphere-ovn-upi |
Full PR test history. Your PR dashboard.
going to close this PR and mark bug as "won't do" reason being- we could not pinpoint the exact error this bug was hitting in our "time boxed" time allotted for this bug. The reason we are marking as "wont do" is because the scenario that this bug is testing is a scenario in which the MCO is being "abused", meaning that reapplying and deleting machine configs in such quick succession (such as the script given in the bug report) is not a recommended use of the MCO and we don't expect our customers to be interacting with the MCO in this fashion.
We are so grateful for @sergiordlr for finding this bug, and he should feel empowered to open this again if a customer or someone internal starts to see this issue in the future. But for now, we are going to close it.
@dkhater-redhat: This pull request references Jira Issue OCPBUGS-31255. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.
- What I did Added sleep interval to delay updates on nodes
- How to verify it Run scripts found in the Jira Bug report and note if a degradation occurs
no degradation seen
- Description for the changelog