OCPBUGS-1327: IBMCloud: Replace stuck machine

cjschaef commented 1 year ago

During initial cluster bring up, some component can prevent the IBM Cloud VSI from reaching the MCS, causing it to get stuck when the DHCP lease expires. Attempt to identify these cases, and attempt a single replacement of the machine to mitigate this issue.

openshift-ci-robot commented 1 year ago

@cjschaef: This pull request references Jira Issue OCPBUGS-1327, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.12.0) matches configured target version for branch (4.12.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @MayXuQQ

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/10): >During initial cluster bring up, some component can prevent the >IBM Cloud VSI from reaching the MCS, causing it to get stuck when >the DHCP lease expires. Attempt to identify these cases, and attempt >a single replacement of the machine to mitigate this issue. > >Related: https://issues.redhat.com/browse/OCPBUGS-1327 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

MayXuQQ commented 1 year ago

tried with pre-merge build registry.build05.ci.openshift.org/ci-ln-rif5icb/release:latest test with networkType OVNKubernetes and OpenShiftSDN fips: true and false publish: Internal and External region: (region: eu-de, br-sao, us-south, us-east) got the following error. rerun is OK. Pass: 4 Failed 2.

networkType: OVNKubernetes eu-de failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"network\" stage: failed to create cluster: failed to apply Terraform: exit status 1\n\nError: Action is not authorized.\n\n with module.cis.data.ibm_cis_domain.base_domain[0],\n on cis/main.tf line 5, in data \"ibm_cis_domain\" \"base_domain\":\n 5: data \"ibm_cis_domain\" \"base_domain\" {\n\n"
fips: true + networkType: OVNKubernetes us-south failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"bootstrap\" stage: failed to create cluster: failed to apply Terraform: exit status 1\n\nError: image r006-ffebee31-07ff-4639-bad3-5b4ec77ade81 is in an invalid state\n\n with ibm_is_instance.bootstrap_node,\n on main.tf line 11, in resource \"ibm_is_instance\" \"bootstrap_node\":\n 11: resource \"ibm_is_instance\" \"bootstrap_node\" {\n\n"

MayXuQQ commented 1 year ago

/label qe-approved

cjschaef commented 1 year ago

e2e-ibmcloud failure

 level=error msg=Error: [ERROR] Error retrieving service offering: Get "https://globalcatalog.cloud.ibm.com/api/v1/?include=%2A&q=cloud-object-storage": dial tcp [2606:4700::6812:7062]:443: connect: network is unreachable
level=error
level=error msg=  with ibm_resource_instance.cos,
level=error msg=  on main.tf line 22, in resource "ibm_resource_instance" "cos":
level=error msg=  22: resource "ibm_resource_instance" "cos" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "network" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: [ERROR] Error retrieving service offering: Get "https://globalcatalog.cloud.ibm.com/api/v1/?include=%2A&q=cloud-object-storage": dial tcp [2606:4700::6812:7062]:443: connect: network is unreachable
level=error
level=error msg=  with ibm_resource_instance.cos,
level=error msg=  on main.tf line 22, in resource "ibm_resource_instance" "cos":
level=error msg=  22: resource "ibm_resource_instance" "cos" {

This looks like an IBM COS issue, unrelated to these changes. I'll check IBM Cloud status for COS, but expect we'll need to retrigger once COS is healthy, if it is down.

cjschaef commented 1 year ago

/retest

cjschaef commented 1 year ago

Current e2e-ibmcloud is in OCP Conformance testing now, installation was successful

time="2022-11-03T17:56:12Z" level=info msg="Install complete!"
time="2022-11-03T17:56:12Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/installer/auth/kubeconfig'"
time="2022-11-03T17:56:12Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ci-op-i4kbf6l8-6383f.ci-ibmcloud.devcluster.openshift.com"
time="2022-11-03T17:56:12Z" level=info msg="Login to the console with user: \"kubeadmin\", and password: REDACTED
time="2022-11-03T17:56:12Z" level=debug msg="Time elapsed per stage:"
time="2022-11-03T17:56:12Z" level=debug msg="           network: 8m9s"
time="2022-11-03T17:56:12Z" level=debug msg="         bootstrap: 4m22s"
time="2022-11-03T17:56:12Z" level=debug msg="            master: 11m28s"
time="2022-11-03T17:56:12Z" level=debug msg="Bootstrap Complete: 48s"
time="2022-11-03T17:56:12Z" level=debug msg=" Bootstrap Destroy: 4m53s"
time="2022-11-03T17:56:12Z" level=debug msg=" Cluster Operators: 12m45s"
time="2022-11-03T17:56:12Z" level=info msg="Time elapsed: 43m27s"

I will review machine-controller logs, but CI may not duplicate the OCPBUGS-1327 failure. I can provide log-bundles for test clusters with these code changes for additional confirmation on the fix, as necessary.

cjschaef commented 1 year ago

I don't see the MachineReplacement was required in the CI, which is okay, just unfortunate it wasn't replicated in this CI run. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-provider-ibmcloud/10/pull-ci-openshift-machine-api-provider-ibmcloud-main-e2e-ibmcloud/1588214891066429440/artifacts/e2e-ibmcloud/gather-extra/artifacts/machines.json Perhaps maybe we'll see it in the next build, when I update PR per review comments above.

elmiko commented 1 year ago

also worth saying, thank you for adding all the comments. it made reviewing this PR much easier :bow:

cjschaef commented 1 year ago

LOL, it was the only way I could get through writing this complex beast.

MayXuQQ commented 1 year ago

4.12.0-0.ci.test-2022-11-04-014628-ci-ln-pwvc30b-latest with the commit merging: #10 4bdfe803 pass

cjschaef commented 1 year ago

e2e-ibmcloud failure

time="2022-11-03T22:25:48Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2022-11-03T22:25:48Z" level=info msg="Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {\"code\":\"ACCT-MGMT-11\",\"href\":\"/api/accounts_mgmt/v1/errors/11\",\"id\":\"11\",\"kind\":\"Error\",\"operation_id\":\"cb887a6b-f58f-485c-a5f0-2478677f7adf\",\"reason\":\"Account with ID 2DUeKzzTD9ngfsQ6YgkzdJn1jA4 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates\"}"

ingress                                    4.13.0-0.ci.test-2022-11-03-210153-ci-op-xiq2lqdi-latest   True        False         True       55m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

I believe this ties back to slow IBM Cloud LB updates/configurations. Given time, the LB would likely be healthy and the ingress canary checks would pass, allowing the remaining operators to become healthy.

Re-kicking test /retest

cjschaef commented 1 year ago

e2e-ibmcloud had some unexpected OCP Conformance failures, retrying /retests

cjschaef commented 1 year ago

oops, not retests /retest

openshift-ci[bot] commented 1 year ago

@cjschaef: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

JoelSpeed commented 1 year ago

/approve

openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-api-provider-ibmcloud/blob/main/OWNERS)~~ [JoelSpeed] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci-robot commented 1 year ago

@cjschaef: All pull requests linked via external trackers have merged:

openshift/machine-api-provider-ibmcloud#10

Jira Issue OCPBUGS-1327 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/10): >During initial cluster bring up, some component can prevent the >IBM Cloud VSI from reaching the MCS, causing it to get stuck when >the DHCP lease expires. Attempt to identify these cases, and attempt >a single replacement of the machine to mitigate this issue. > >Related: https://issues.redhat.com/browse/OCPBUGS-1327 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

cjschaef commented 1 year ago

/cherry-pick release-4.12

cjschaef commented 1 year ago

/cherrypick release-4.12

openshift-cherrypick-robot commented 1 year ago

@cjschaef: new pull request created: #11

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/10#issuecomment-1303962324): >/cherry-pick release-4.12 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift-cherrypick-robot commented 1 year ago

@cjschaef: new pull request could not be created: failed to create pull request against openshift/machine-api-provider-ibmcloud#release-4.12 from head openshift-cherrypick-robot:cherry-pick-10-to-release-4.12: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for openshift-cherrypick-robot:cherry-pick-10-to-release-4.12."}],"documentation_url":"https://docs.github.com/rest/reference/pulls#create-a-pull-request"}

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/10#issuecomment-1303962703): >/cherrypick release-4.12 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / machine-api-provider-ibmcloud

OCPBUGS-1327: IBMCloud: Replace stuck machine #10