openshift / cluster-version-operator

Apache License 2.0
84 stars 191 forks source link

OTA-861: Set Upgradeable=False when there is an upgrade in progress #1080

Closed hongkailiu closed 3 days ago

hongkailiu commented 3 months ago

This PR add a Upgradeable which fails on Processing=True in clusterversion.status.conditions. In other words, Upgradeable=False if an upgrade is in progress, including both minor level and patch level.

In addition, this PR syncs the upgradeable at the shutdown time to ensure the unsaved (due to by the throttle) upgradeable to be saved. This part could be a separate PR too.

For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade 4.14.35 -> 4.15.29 completes.

It also covers the case 4.14.15-> 4.14.35 -> 4.15.29 where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade 4.14.15-> 4.14.35 completes.

Note that we still allow for upgrade to 4.y+1.z'' in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade 4.y.z -> 4.y+1.z'' might not be supported. This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. We need the retarget to it to land 4.y+1 on the cluster.

For OTA-861, the guard on retargeting to a minor level upgrade will be added with a follow up PR.


Update: With the throttle of "upgradeable" disabled on the shutdown, the racing window between "CVO starts rolling the new version out" and "CVO gets shutdown" is very short (less than one second in the test). We do not need to add the second guard back that was dropped in the 362e9ca912edd948b93c2b0545c508cb4bf7bd84.

The acceptedRisks for Y-then-Z upgrade in the dropped commit will be done with another PR.

hongkailiu commented 3 months ago

This PR is replacing https://github.com/openshift/cluster-version-operator/pull/1079

hongkailiu commented 3 months ago

/test unit

DavidHurta commented 3 months ago

/cc

wking commented 3 months ago

Exercise a 4.17 -> this-pull update, so we can see Upgradeable=False while we're mid-update.

/payload-job periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-upgrade

openshift-ci[bot] commented 3 months ago

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7d5ec080-663b-11ef-899f-2883517bb747-0

petr-muller commented 2 months ago

/uncc

David and Trevor are involved in this one, seems enough ;)

DavidHurta commented 2 months ago

/title OTA-861: inhibit the 2nd minor version upgrade

DavidHurta commented 2 months ago

/retitle OTA-861: inhibit the 2nd minor version upgrade

openshift-ci-robot commented 2 months ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR prevents the cluster to be upgraded to x.y+2.z1 while >the upgrade to x.y+1.z2 from x.y.z3 is still in progress. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
petr-muller commented 2 months ago

/test all

openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR prevents the cluster to be upgraded to x.y+2.z1 while >the upgrade to x.y+1.z2 from x.y.z3 is still in progress. > >Update: >[Block Y stream upgrade if any upgrade is in progress](https://github.com/openshift/cluster-version-operator/pull/1080/commits/5ea5ebe0e12bf1a40ab21a31a5f6d5930766e95e) >This commits entends the guard on the 2nd Y-stream upgrade, >i.e., blocking an Y-stream upgrade if there is already an >Y-stream upgrade in progress, >to the guard on any Y-stream upgrade if there is already an >upgrade in progress, reguardless of Y-stream or Z-stream. > >For example, it covers the case 4.14.15-> 4.14.35 -> 4.15.29 >where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade >4.14.15-> 4.14.35 completes. > >Note that we still allow for upgrade to 4.y+1.z'' >in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade >4.y.z -> 4.y+1.z'' might not be supported. >This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed >up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. >We need the retarget to it to land 4.y+1 on the cluster. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR adds a guard that blocks an Y-stream upgrade >if there is already an upgrade in progress, reguardless of Y-stream or Z-stream. > >For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade >4.14.35 -> 4.15.29 completes. > >It also covers the case 4.14.15-> 4.14.35 -> 4.15.29 >where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade >4.14.15-> 4.14.35 completes. > >Note that we still allow for upgrade to 4.y+1.z'' >in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade >4.y.z -> 4.y+1.z'' might not be supported. >This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed >up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. >We need the retarget to it to land 4.y+1 on the cluster. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
petr-muller commented 1 month ago

https://github.com/openshift/release/pull/57408 should fix the hypershift jobs

hongkailiu commented 1 month ago

/label tide/merge-method-squash

hongkailiu commented 1 month ago

/retest-required

openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR add a Upgradeable which fails on Processing=True in `clusterversion.status.conditions`. >In other words, `Upgradeable=False` if an upgrade is in progress, including both minor level and patch level. > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR add a Upgradeable which fails on Processing=True in `clusterversion.status.conditions`. >In other words, `Upgradeable=False` if an upgrade is in progress, including both minor level and patch level. > >In addition, this PR syncs the upgradeable at the shutdown time to ensure the unsaved (due to by the throttle) upgradeable to be saved. This part could be a separate PR too. > >For [OTA-861](https://issues.redhat.com//browse/OTA-861), the guard on retargeting to a minor level upgrade will be added with a follow up PR. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
hongkailiu commented 1 month ago

/test e2e-hypershift

openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR add a Upgradeable which fails on Processing=True in `clusterversion.status.conditions`. >In other words, `Upgradeable=False` if an upgrade is in progress, including both minor level and patch level. > >In addition, this PR syncs the upgradeable at the shutdown time to ensure the unsaved (due to by the throttle) upgradeable to be saved. This part could be a separate PR too. > >This PR adds a guard that blocks an Y-stream upgrade >if there is already an upgrade in progress, reguardless of Y-stream or Z-stream. > >For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade >4.14.35 -> 4.15.29 completes. > >It also covers the case 4.14.15-> 4.14.35 -> 4.15.29 >where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade >4.14.15-> 4.14.35 completes. > >Note that we still allow for upgrade to 4.y+1.z'' >in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade >4.y.z -> 4.y+1.z'' might not be supported. >This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed >up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. >We need the retarget to it to land 4.y+1 on the cluster. > >For [OTA-861](https://issues.redhat.com//browse/OTA-861), the guard on retargeting to a minor level upgrade will be added with a follow up PR. > >--- > >Update: >With the throttle of "upgradeable" disabled on the shutdown, the racing window between "CVO starts rolling the new version out" and "CVO gets shutdown" is very short (less than one second in [the test](https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1728060448767569?thread_ts=1727723159.289959&cid=CJ1J9C3V4)). >We do not need to add the second guard back that was dropped in the last commit. > >The acceptedRisks for Y-then-Z upgrade in the dropped commit will be done with another PR. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR add a Upgradeable which fails on Processing=True in `clusterversion.status.conditions`. >In other words, `Upgradeable=False` if an upgrade is in progress, including both minor level and patch level. > >In addition, this PR syncs the upgradeable at the shutdown time to ensure the unsaved (due to by the throttle) upgradeable to be saved. This part could be a separate PR too. > >This PR adds a guard that blocks an Y-stream upgrade >if there is already an upgrade in progress, reguardless of Y-stream or Z-stream. > >For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade >4.14.35 -> 4.15.29 completes. > >It also covers the case 4.14.15-> 4.14.35 -> 4.15.29 >where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade >4.14.15-> 4.14.35 completes. > >Note that we still allow for upgrade to 4.y+1.z'' >in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade >4.y.z -> 4.y+1.z'' might not be supported. >This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed >up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. >We need the retarget to it to land 4.y+1 on the cluster. > >For [OTA-861](https://issues.redhat.com//browse/OTA-861), the guard on retargeting to a minor level upgrade will be added with a follow up PR. > >--- > >Update: >With the throttle of "upgradeable" disabled on the shutdown, the racing window between "CVO starts rolling the new version out" and "CVO gets shutdown" is very short (less than one second in [the test](https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1728060448767569?thread_ts=1727723159.289959&cid=CJ1J9C3V4)). >We do not need to add the second guard back that was dropped in the 362e9ca912edd948b93c2b0545c508cb4bf7bd84. > >The acceptedRisks for Y-then-Z upgrade in the dropped commit will be done with another PR. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
hongkailiu commented 1 month ago

/test unit

openshift-ci-robot commented 1 month ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR add a Upgradeable which fails on Processing=True in `clusterversion.status.conditions`. >In other words, `Upgradeable=False` if an upgrade is in progress, including both minor level and patch level. > >In addition, this PR syncs the upgradeable at the shutdown time to ensure the unsaved (due to by the throttle) upgradeable to be saved. This part could be a separate PR too. > >For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade >4.14.35 -> 4.15.29 completes. > >It also covers the case 4.14.15-> 4.14.35 -> 4.15.29 >where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade >4.14.15-> 4.14.35 completes. > >Note that we still allow for upgrade to 4.y+1.z'' >in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade >4.y.z -> 4.y+1.z'' might not be supported. >This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed >up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. >We need the retarget to it to land 4.y+1 on the cluster. > >For [OTA-861](https://issues.redhat.com//browse/OTA-861), the guard on retargeting to a minor level upgrade will be added with a follow up PR. > >--- > >Update: >With the throttle of "upgradeable" disabled on the shutdown, the racing window between "CVO starts rolling the new version out" and "CVO gets shutdown" is very short (less than one second in [the test](https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1728060448767569?thread_ts=1727723159.289959&cid=CJ1J9C3V4)). >We do not need to add the second guard back that was dropped in the 362e9ca912edd948b93c2b0545c508cb4bf7bd84. > >The acceptedRisks for Y-then-Z upgrade in the dropped commit will be done with another PR. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
hongkailiu commented 1 month ago

/test e2e-hypershift

hongkailiu commented 1 month ago

/test unit

openshift-ci[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-version-operator/blob/master/OWNERS)~~ [petr-muller,wking] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
petr-muller commented 2 weeks ago

/retest

evakhoni commented 4 days ago

pre-merge verified successfully in all four varients /label qe-approved

openshift-ci-robot commented 4 days ago

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080): >This PR add a Upgradeable which fails on Processing=True in `clusterversion.status.conditions`. >In other words, `Upgradeable=False` if an upgrade is in progress, including both minor level and patch level. > >In addition, this PR syncs the upgradeable at the shutdown time to ensure the unsaved (due to by the throttle) upgradeable to be saved. This part could be a separate PR too. > >For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade >4.14.35 -> 4.15.29 completes. > >It also covers the case 4.14.15-> 4.14.35 -> 4.15.29 >where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade >4.14.15-> 4.14.35 completes. > >Note that we still allow for upgrade to 4.y+1.z'' >in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade >4.y.z -> 4.y+1.z'' might not be supported. >This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed >up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''. >We need the retarget to it to land 4.y+1 on the cluster. > >For [OTA-861](https://issues.redhat.com//browse/OTA-861), the guard on retargeting to a minor level upgrade will be added with a follow up PR. > >--- > >Update: >With the throttle of "upgradeable" disabled on the shutdown, the racing window between "CVO starts rolling the new version out" and "CVO gets shutdown" is very short (less than one second in [the test](https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1728060448767569?thread_ts=1727723159.289959&cid=CJ1J9C3V4)). >We do not need to add the second guard back that was dropped in the 362e9ca912edd948b93c2b0545c508cb4bf7bd84. > >The acceptedRisks for Y-then-Z upgrade in the dropped commit will be done with another PR. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-version-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
wking commented 3 days ago

operators should not create watch channels very often is unrelated:

/override ci/prow/e2e-agnostic-ovn

openshift-ci[bot] commented 3 days ago

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-ovn

In response to [this](https://github.com/openshift/cluster-version-operator/pull/1080#issuecomment-2499263289): >[`operators should not create watch channels very often`][1] is unrelated: > >/override ci/prow/e2e-agnostic-ovn > >[1]: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1080/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn/1861146112841224192 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 3 days ago

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-bot commented 3 days ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator This PR has been included in build cluster-version-operator-container-v4.19.0-202411260335.p0.gb6b7345.assembly.stream.el9. All builds following this will include this PR.