openshift / assisted-service

Apache License 2.0
109 stars 209 forks source link

MGMT-10006: Wait for host to deprovision in BMAC #6666

Closed carbonin closed 4 weeks ago

carbonin commented 1 month ago

Only BMAC should be aware of the BMH and its status. When deprovisioning a node by deleting a BMH, make BMAC wait for the node to fully deprovision before annotating and deleting the agent and removing the finalizer.

This adds a new annotation for the agent resource (agent.agent-install/clean-spoke-on-delete) which is used in place of the bmac annotation to communicate to the agent controller that the node should be removed.

This gets us a bit closer to https://issues.redhat.com/browse/MGMT-10006 by removing the BMH concept from the "remove a node" flow in the agent controller. This will allow this logic to be more easily reused in the late-binding/non-bmh case.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-10006

What environments does this code impact?

How was this code tested?

Tested manually using a dev-scripts cluster.

  1. Deployed a 6 node OCP cluster from a bare metal hub
  2. Annotated and removed a worker node BMH with bmac.agent-install.openshift.io/remove-agent-and-node-on-delete and removed the BMH
  3. Waited for successful removal

Checklist

Reviewers Checklist

openshift-ci-robot commented 1 month ago

@carbonin: This pull request references MGMT-10006 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/assisted-service/pull/6666): >Only BMAC should be aware of the BMH and its status. When deprovisioning a node by deleting a BMH, make BMAC wait for the node to fully deprovision before annotating and deleting the agent and removing the finalizer. > >This adds a new annotation for the agent resource (`agent.agent-install/clean-spoke-on-delete`) which is used in place of the bmac annotation to communicate to the agent controller that the node should be removed. > >This gets us a bit closer to https://issues.redhat.com/browse/MGMT-10006 by removing the BMH concept from the "remove a node" flow in the agent controller. This will allow this logic to be more easily reused in the late-binding/non-bmh case. > >## List all the issues related to this PR > >https://issues.redhat.com/browse/MGMT-10006 > >- [ ] New Feature >- [x] Enhancement >- [ ] Bug fix >- [ ] Tests >- [ ] Documentation >- [ ] CI/CD > >## What environments does this code impact? > >- [ ] Automation (CI, tools, etc) >- [ ] Cloud >- [x] Operator Managed Deployments >- [ ] None > >## How was this code tested? > >Tested manually using a dev-scripts cluster. >1. Deployed a 6 node OCP cluster from a bare metal hub >2. Annotated and removed a worker node BMH with `bmac.agent-install.openshift.io/remove-agent-and-node-on-delete` and removed the BMH >3. Waited for successful removal > > > >- [ ] assisted-test-infra environment >- [x] dev-scripts environment >- [ ] Reviewer's test appreciated >- [x] Waiting for CI to do a full test run >- [ ] Manual (Elaborate on how it was tested) >- [ ] No tests needed > >## Checklist > >- [x] Title and description added to both, commit and PR. >- [x] Relevant issues have been associated (see [CONTRIBUTING] guide) >- [x] This change does not require a documentation update (docstring, `docs`, README, etc) >- [x] Does this change include unit-tests (note that code changes require unit-tests) > >## Reviewers Checklist > >- Are the title and description (in both PR and commit) meaningful and clear? >- Is there a bug required (and linked) for this change? >- Should this PR be backported? Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fassisted-service). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
carbonin commented 1 month ago

/test ?

openshift-ci[bot] commented 1 month ago

@carbonin: The following commands are available to trigger required jobs:

The following commands are available to trigger optional jobs:

Use /test all to run the following jobs that were automatically triggered:

In response to [this](https://github.com/openshift/assisted-service/pull/6666#issuecomment-2276433656): >/test ? Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carbonin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/assisted-service/blob/master/OWNERS)~~ [carbonin] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
carbonin commented 1 month ago

/hold

Want to be able to run the remove node job, but it's only configured as a periodic. I'll fix that then run it from here.

carbonin commented 1 month ago

Added the presubmit job in https://github.com/openshift/release/pull/55372

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 69.40%. Comparing base (6dd2882) to head (a8a9b4e). Report is 12 commits behind head on master.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/openshift/assisted-service/pull/6666/graphs/tree.svg?width=650&height=150&src=pr&token=YOR4NSSOXQ&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift)](https://app.codecov.io/gh/openshift/assisted-service/pull/6666?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) ```diff @@ Coverage Diff @@ ## master #6666 +/- ## ========================================== + Coverage 68.53% 69.40% +0.86% ========================================== Files 246 246 Lines 36740 37928 +1188 ========================================== + Hits 25179 26323 +1144 - Misses 9318 9337 +19 - Partials 2243 2268 +25 ``` | [Files](https://app.codecov.io/gh/openshift/assisted-service/pull/6666?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) | Coverage Δ | | |---|---|---| | [...nternal/controller/controllers/agent\_controller.go](https://app.codecov.io/gh/openshift/assisted-service/pull/6666?src=pr&el=tree&filepath=internal%2Fcontroller%2Fcontrollers%2Fagent_controller.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#diff-aW50ZXJuYWwvY29udHJvbGxlci9jb250cm9sbGVycy9hZ2VudF9jb250cm9sbGVyLmdv) | `74.66% <100.00%> (-0.12%)` | :arrow_down: | | [...nal/controller/controllers/bmh\_agent\_controller.go](https://app.codecov.io/gh/openshift/assisted-service/pull/6666?src=pr&el=tree&filepath=internal%2Fcontroller%2Fcontrollers%2Fbmh_agent_controller.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#diff-aW50ZXJuYWwvY29udHJvbGxlci9jb250cm9sbGVycy9ibWhfYWdlbnRfY29udHJvbGxlci5nbw==) | `79.29% <100.00%> (+2.67%)` | :arrow_up: | ... and [9 files with indirect coverage changes](https://app.codecov.io/gh/openshift/assisted-service/pull/6666/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift)
carbonin commented 1 month ago

/test ?

openshift-ci[bot] commented 1 month ago

@carbonin: The following commands are available to trigger required jobs:

The following commands are available to trigger optional jobs:

Use /test all to run the following jobs that were automatically triggered:

In response to [this](https://github.com/openshift/assisted-service/pull/6666#issuecomment-2277857360): >/test ? Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
carbonin commented 1 month ago

/test edge-e2e-ai-operator-ztp-remove-node

carbonin commented 1 month ago

/unhold

carbonin commented 1 month ago

@eranco74 can you take a look at this one?

openshift-ci-robot commented 4 weeks ago

/retest-required

Remaining retests: 0 against base HEAD 61c279a23a9720d34a452befad7a6c8c3db11b80 and 2 for PR HEAD a8a9b4e31cfdb7c4ca61aaeacf371092467e0644 in total

gamli75 commented 4 weeks ago

/retest-required

gamli75 commented 4 weeks ago

/test edge-e2e-ai-operator-ztp

carbonin commented 4 weeks ago

Failed during finalizing.

Events:

  {
    "cluster_id": "2d5a017d-dac0-4fb6-bee4-4a6a65665015",
    "event_time": "2024-08-15T12:55:31.616Z",
    "message": "Updated status of the cluster to finalizing",
    "name": "cluster_status_updated",
    "severity": "info"
  },
  {
    "cluster_id": "2d5a017d-dac0-4fb6-bee4-4a6a65665015",
    "event_time": "2024-08-15T12:56:01.518Z",
    "message": "Updated finalizing stage of the cluster to 'Waiting for cluster operators'",
    "name": "cluster_finalizing_stage_updated",
    "severity": "info"
  },
  {
    "cluster_id": "2d5a017d-dac0-4fb6-bee4-4a6a65665015",
    "event_time": "2024-08-15T12:56:01.530Z",
    "message": "Operator console status: progressing message: ",
    "name": "cluster_operator_status",
    "severity": "info"
  },
  {
    "cluster_id": "2d5a017d-dac0-4fb6-bee4-4a6a65665015",
    "event_time": "2024-08-15T12:56:01.539Z",
    "message": "Operator cvo status: failed message: ",
    "name": "cluster_operator_status",
    "severity": "info"
  },

Controller logs show CVO failure then connection refused from the API server:

time="2024-08-15T13:02:01Z" level=info msg="CVO status conditions: [{Type:RetrievedUpdates Status:False LastTransitionTime:2024-08-15 12:48:05 +0000 UTC Reason:VersionNotFound Message:Unable to retrieve available updates: currently reconciling cluster version 4.17.0-0.nightly-2024-08-13-031847 not found in the \"stable-4.17\" channel} {Type:ImplicitlyEnabledCapabilities Status:False LastTransitionTime:2024-08-15 12:48:05 +0000 UTC Reason:AsExpected Message:Capabilities match configured spec} {Type:ReleaseAccepted Status:True LastTransitionTime:2024-08-15 12:48:05 +0000 UTC Reason:PayloadLoaded Message:Payload loaded version=\"4.17.0-0.nightly-2024-08-13-031847\" image=\"registry.build03.ci.openshift.org/ci-op-nlt6g4kr/release@sha256:609cf659f5182ccd0e85987cde6fa234b68bdf6dab8729860d7413445d0e5dda\" architecture=\"amd64\"} {Type:Available Status:False LastTransitionTime:2024-08-15 12:48:05 +0000 UTC Reason: Message:} {Type:Failing Status:True LastTransitionTime:2024-08-15 13:00:03 +0000 UTC Reason:MultipleErrors Message:Multiple errors are preventing progress:\n* Cluster operators authentication, etcd, image-registry, ingress, kube-apiserver, kube-controller-manager, monitoring, openshift-apiserver, openshift-controller-manager, openshift-samples, operator-lifecycle-manager-packageserver are not available\n* Could not update imagestream \"openshift/driver-toolkit\" (612 of 900): the server is down or not responding\n* Could not update oauthclient \"console\" (546 of 900): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (818 of 900): resource may have been deleted\n* Could not update role \"openshift-console/prometheus-k8s\" (822 of 900): resource may have been deleted} {Type:Progressing Status:True LastTransitionTime:2024-08-15 12:48:05 +0000 UTC Reason:MultipleErrors Message:Unable to apply 4.17.0-0.nightly-2024-08-13-031847: an unknown error has occurred: MultipleErrors}]" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.ClusterVersionHandler.GetStatus file="/go/src/github.com/openshift/assisted-installer/src/assisted_installer_controller/operator_handler.go:128"
time="2024-08-15T13:03:01Z" level=error msg="Failed to check if console is enabled" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitingForClusterOperators.func1" file="/go/src/github.com/openshift/assisted-installer/src/assisted_installer_controller/assisted_installer_controller.go:1033" error="Get \"https://localhost:6443/apis/config.openshift.io/v1/clusterversions/version\": dial tcp [::1]:6443: connect: connection refused"
time="2024-08-15T13:04:01Z" level=info msg="Start uploading controller logs (intermediate snapshot)" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).UploadLogs" file="/go/src/github.com/openshift/assisted-installer/src/assisted_installer_controller/assisted_installer_controller.go:1464"
time="2024-08-15T13:04:01Z" level=info msg="Uploading logs for assisted-installer-controller-lzzk7 in assisted-installer" func=github.com/openshift/assisted-installer/src/common.UploadPodLogs file="/go/src/github.com/openshift/assisted-installer/src/common/common.go:136"
time="2024-08-15T13:04:01Z" level=error msg="Failed to get logs from kube-api, reading from file" func=github.com/openshift/assisted-installer/src/common.GetControllerPodLogs file="/go/src/github.com/openshift/assisted-installer/src/common/common.go:112" error="Get \"https://localhost:6443/api/v1/namespaces/assisted-installer/pods/assisted-installer-controller-lzzk7/log\": dial tcp [::1]:6443: connect: connection refused"
carbonin commented 4 weeks ago

/retest-required

openshift-ci[bot] commented 4 weeks ago

@carbonin: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).