OCPBUGS-31255: Degraded worker pool when applying configs to master and worker pools at the same time

dkhater-redhat commented 2 weeks ago

- What I did Added sleep interval to delay updates on nodes

- How to verify it Run scripts found in the Jira Bug report and note if a degradation occurs

./continuous_update.sh                                                                            
MCP master updated.
Creating mc-reproducer-master
machineconfig.machineconfiguration.openshift.io/mc-reproducer-master created
MCP worker updated.
Creating mc-reproducer-worker
machineconfig.machineconfiguration.openshift.io/mc-reproducer-worker created
MCP master updated.
Deleting mc-reproducer-master
machineconfig.machineconfiguration.openshift.io "mc-reproducer-master" deleted
MCP worker updated.
Deleting mc-reproducer-worker
machineconfig.machineconfiguration.openshift.io "mc-reproducer-worker" deleted
MCP master updated.
Creating mc-reproducer-master
machineconfig.machineconfiguration.openshift.io/mc-reproducer-master created
MCP worker updated.
Creating mc-reproducer-worker
machineconfig.machineconfiguration.openshift.io/mc-reproducer-worker created
MCP master updated.
Deleting mc-reproducer-master
machineconfig.machineconfiguration.openshift.io "mc-reproducer-master" deleted
MCP worker updated.
Deleting mc-reproducer-worker
machineconfig.machineconfiguration.openshift.io "mc-reproducer-worker" deleted
...

MCO/bugs/degraded-worker-pool-loop on ☁️  (us-east-1) on ☁️  aos-serviceaccount@openshift-gce-devel.iam.gserviceaccount.com 
❯ ./watcher.sh                                                                                      
^C

❯ oc get mcp -w
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-6b6f9caa3a47918196310a162d5e82b2   True      False      False      3              3                   3                     0                      102m
worker   rendered-worker-c61033de989233f93639617ae5e8c42b   True      False      False      3              3                   3                     0                      102m
^C%

no degradation seen

- Description for the changelog

openshift-ci-robot commented 2 weeks ago

@dkhater-redhat: This pull request references Jira Issue OCPBUGS-31255, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4451): > > >**- What I did** >Added sleep interval to delay updates on nodes > >**- How to verify it** >Run scripts found in the Jira Bug report and note if a degradation occurs > >**- Description for the changelog** > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-ci[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dkhater-redhat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [dkhater-redhat] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

dkhater-redhat commented 2 weeks ago

/retest-required

dkhater-redhat commented 2 weeks ago

/retest-required

dkhater-redhat commented 5 days ago

So taking a look back at this, i tried implementing changes by modifying the writer in the daemon by adding a "retry" when applying configurations. However, with that, and increased logging within the update.go sync loop, i noticed that the node degradation was occuring prior to the update loop. To me, this confirms that what we were seeing was not caused in the daemon. I reverted back to my original method of adding a sleep within the node controller. Not only did i see no degradations, but the mcp update time decreased by about 8x. This leads me to believe that there could be a number of issues at play here:

If the API is being rate-limited, adding a delay can help reduce the frequency of requests, allowing the system to handle them more effectively without hitting rate limits
There might be a delay in the propagation of state changes across the cluster. By introducing a delay, we give the system time to stabilize and make sure all nodes have the correct state before proceeding.
Concurrent operations on multiple nodes can cause conflicts or race conditions.
The nodes or the cluster might require some time to free up resources or complete ongoing operations before they can proceed with the updates.

I do not mind playing with how we introduce this sleep, but I do believe this is causing the error at hand.

dkhater-redhat commented 5 days ago

/retest-required

openshift-ci[bot] commented 4 days ago

@dkhater-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-zones	fc868bd1973d7be4646844e6b7355af2cd5b09a2	link	false	`/test e2e-vsphere-ovn-zones`
ci/prow/unit	fc868bd1973d7be4646844e6b7355af2cd5b09a2	link	true	`/test unit`
ci/prow/e2e-aws-ovn-upgrade-out-of-change	fc868bd1973d7be4646844e6b7355af2cd5b09a2	link	false	`/test e2e-aws-ovn-upgrade-out-of-change`
ci/prow/e2e-vsphere-ovn-upi	fc868bd1973d7be4646844e6b7355af2cd5b09a2	link	false	`/test e2e-vsphere-ovn-upi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

dkhater-redhat commented 4 days ago

going to close this PR and mark bug as "won't do" reason being- we could not pinpoint the exact error this bug was hitting in our "time boxed" time allotted for this bug. The reason we are marking as "wont do" is because the scenario that this bug is testing is a scenario in which the MCO is being "abused", meaning that reapplying and deleting machine configs in such quick succession (such as the script given in the bug report) is not a recommended use of the MCO and we don't expect our customers to be interacting with the MCO in this fashion.

We are so grateful for @sergiordlr for finding this bug, and he should feel empowered to open this again if a customer or someone internal starts to see this issue in the future. But for now, we are going to close it.

openshift-ci-robot commented 4 days ago

@dkhater-redhat: This pull request references Jira Issue OCPBUGS-31255. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4451): > > >**- What I did** >Added sleep interval to delay updates on nodes > >**- How to verify it** >Run scripts found in the Jira Bug report and note if a degradation occurs > >``` >./continuous_update.sh >MCP master updated. >Creating mc-reproducer-master >machineconfig.machineconfiguration.openshift.io/mc-reproducer-master created >MCP worker updated. >Creating mc-reproducer-worker >machineconfig.machineconfiguration.openshift.io/mc-reproducer-worker created >MCP master updated. >Deleting mc-reproducer-master >machineconfig.machineconfiguration.openshift.io "mc-reproducer-master" deleted >MCP worker updated. >Deleting mc-reproducer-worker >machineconfig.machineconfiguration.openshift.io "mc-reproducer-worker" deleted >MCP master updated. >Creating mc-reproducer-master >machineconfig.machineconfiguration.openshift.io/mc-reproducer-master created >MCP worker updated. >Creating mc-reproducer-worker >machineconfig.machineconfiguration.openshift.io/mc-reproducer-worker created >MCP master updated. >Deleting mc-reproducer-master >machineconfig.machineconfiguration.openshift.io "mc-reproducer-master" deleted >MCP worker updated. >Deleting mc-reproducer-worker >machineconfig.machineconfiguration.openshift.io "mc-reproducer-worker" deleted >... >``` > >``` >MCO/bugs/degraded-worker-pool-loop on ☁️ (us-east-1) on ☁️ aos-serviceaccount@openshift-gce-devel.iam.gserviceaccount.com >❯ ./watcher.sh >^C >``` > >``` >❯ oc get mcp -w >NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE >master rendered-master-6b6f9caa3a47918196310a162d5e82b2 True False False 3 3 3 0 102m >worker rendered-worker-c61033de989233f93639617ae5e8c42b True False False 3 3 3 0 102m >^C% >``` >no degradation seen > >**- Description for the changelog** > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift / machine-config-operator

OCPBUGS-31255: Degraded worker pool when applying configs to master and worker pools at the same time #4451