openshift / assisted-installer

Apache License 2.0
69 stars 98 forks source link

OCPBUGS-41811: (agent-based installer) let the bootstrap wait for workers before the reboot #910

Closed andfasano closed 1 month ago

andfasano commented 1 month ago

As described by the analysis in https://issues.redhat.com/browse/OCPBUGS-41811, in some cases the bootstrap node may reboot before the workers started the joining process, thus removing the assisted-service that it's still required by the workers. This prevents the worker to successfully join the cluster, causing the failure of the cluster deployment.

This patch introduces an explicit synchronization between the bootstrap node and the workers (only in case the installation was performed via the agent-based installer), delaying the bootstrap reboot until all the workers passed the waiting for control plane stage.

openshift-ci-robot commented 1 month ago

@andfasano: This pull request references Jira Issue OCPBUGS-41811, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/assisted-installer/pull/910): >As described by the analysis in https://issues.redhat.com/browse/OCPBUGS-41811, in some cases the bootstrap node may reboot before the workers started the joining process, thus removing the assisted-service that it's still required by the workers. This prevents the worker to successfully join the cluster, causing the failure of the cluster deployment. > >This patch introduces an explicit synchronization between the bootstrap node and the workers (only in case the installation was performed via the agent-based installer), delaying the bootstrap reboot until all the workers passed the `waiting for control plane` stage. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fassisted-installer). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
andfasano commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@andfasano: This pull request references Jira Issue OCPBUGS-41811, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @anuragthehatter

In response to [this](https://github.com/openshift/assisted-installer/pull/910#issuecomment-2374388759): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fassisted-installer). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
andfasano commented 1 month ago

/test ?

openshift-ci[bot] commented 1 month ago

@andfasano: The following commands are available to trigger required jobs:

The following commands are available to trigger optional jobs:

Use /test all to run the following jobs that were automatically triggered:

In response to [this](https://github.com/openshift/assisted-installer/pull/910#issuecomment-2374389529): >/test ? Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
andfasano commented 1 month ago

/test e2e-agent-ha-dualstack /test e2e-agent-sno-ipv6

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 43.47826% with 13 lines in your changes missing coverage. Please review.

Project coverage is 55.61%. Comparing base (2bb58e1) to head (a484397). Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/installer/installer.go 43.47% 8 Missing and 5 partials :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/openshift/assisted-installer/pull/910/graphs/tree.svg?width=650&height=150&src=pr&token=X6PHZZFE9E&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift)](https://app.codecov.io/gh/openshift/assisted-installer/pull/910?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) ```diff @@ Coverage Diff @@ ## master #910 +/- ## ========================================== - Coverage 55.70% 55.61% -0.09% ========================================== Files 15 15 Lines 3208 3231 +23 ========================================== + Hits 1787 1797 +10 - Misses 1249 1257 +8 - Partials 172 177 +5 ``` | [Files with missing lines](https://app.codecov.io/gh/openshift/assisted-installer/pull/910?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) | Coverage Δ | | |---|---|---| | [src/installer/installer.go](https://app.codecov.io/gh/openshift/assisted-installer/pull/910?src=pr&el=tree&filepath=src%2Finstaller%2Finstaller.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#diff-c3JjL2luc3RhbGxlci9pbnN0YWxsZXIuZ28=) | `68.13% <43.47%> (-1.01%)` | :arrow_down: |
andfasano commented 1 month ago

/test e2e-agent-ha-dualstack /test e2e-agent-sno-ipv6

andfasano commented 1 month ago

/test e2e-agent-ha-dualstack /test e2e-agent-sno-ipv6

andfasano commented 1 month ago

/test e2e-agent-ha-dualstack /test e2e-agent-sno-ipv6

tsorya commented 1 month ago

/hold

tsorya commented 1 month ago

/unhold

tsorya commented 1 month ago

/lgtm

tsorya commented 1 month ago

/approve

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, tsorya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/assisted-installer/blob/master/OWNERS)~~ [tsorya] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci[bot] commented 1 month ago

@andfasano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-metal-assisted-odf-4-16 a4843975da7b4122cfc2dd78edc2d5802fb61bac link false /test edge-e2e-metal-assisted-odf-4-16
ci/prow/edge-e2e-metal-assisted-cnv-4-16 a4843975da7b4122cfc2dd78edc2d5802fb61bac link false /test edge-e2e-metal-assisted-cnv-4-16

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 1 month ago

@andfasano: Jira Issue OCPBUGS-41811: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-41811 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/assisted-installer/pull/910): >As described by the analysis in https://issues.redhat.com/browse/OCPBUGS-41811, in some cases the bootstrap node may reboot before the workers started the joining process, thus removing the assisted-service that it's still required by the workers. This prevents the worker to successfully join the cluster, causing the failure of the cluster deployment. > >This patch introduces an explicit synchronization between the bootstrap node and the workers (only in case the installation was performed via the agent-based installer), delaying the bootstrap reboot until all the workers passed the `waiting for control plane` stage. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fassisted-installer). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 1 month ago

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-orchestrator This PR has been included in build ose-agent-installer-orchestrator-container-v4.18.0-202410011141.p0.g51dc014.assembly.stream.el9. All builds following this will include this PR.

openshift-bot commented 1 month ago

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-csr-approver This PR has been included in build ose-agent-installer-csr-approver-container-v4.18.0-202410011141.p0.g51dc014.assembly.stream.el9. All builds following this will include this PR.