Closed cjschaef closed 1 year ago
@cjschaef: This pull request references Jira Issue OCPBUGS-1327, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @MayXuQQ
The bug has been updated to refer to the pull request using the external bug tracker.
tried with pre-merge build registry.build05.ci.openshift.org/ci-ln-rif5icb/release:latest test with networkType OVNKubernetes and OpenShiftSDN fips: true and false publish: Internal and External region: (region: eu-de, br-sao, us-south, us-east) got the following error. rerun is OK. Pass: 4 Failed 2.
/label qe-approved
e2e-ibmcloud
failure
level=error msg=Error: [ERROR] Error retrieving service offering: Get "https://globalcatalog.cloud.ibm.com/api/v1/?include=%2A&q=cloud-object-storage": dial tcp [2606:4700::6812:7062]:443: connect: network is unreachable
level=error
level=error msg= with ibm_resource_instance.cos,
level=error msg= on main.tf line 22, in resource "ibm_resource_instance" "cos":
level=error msg= 22: resource "ibm_resource_instance" "cos" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "network" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: [ERROR] Error retrieving service offering: Get "https://globalcatalog.cloud.ibm.com/api/v1/?include=%2A&q=cloud-object-storage": dial tcp [2606:4700::6812:7062]:443: connect: network is unreachable
level=error
level=error msg= with ibm_resource_instance.cos,
level=error msg= on main.tf line 22, in resource "ibm_resource_instance" "cos":
level=error msg= 22: resource "ibm_resource_instance" "cos" {
This looks like an IBM COS issue, unrelated to these changes. I'll check IBM Cloud status for COS, but expect we'll need to retrigger once COS is healthy, if it is down.
/retest
Current e2e-ibmcloud
is in OCP Conformance testing now, installation was successful
time="2022-11-03T17:56:12Z" level=info msg="Install complete!"
time="2022-11-03T17:56:12Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/installer/auth/kubeconfig'"
time="2022-11-03T17:56:12Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ci-op-i4kbf6l8-6383f.ci-ibmcloud.devcluster.openshift.com"
time="2022-11-03T17:56:12Z" level=info msg="Login to the console with user: \"kubeadmin\", and password: REDACTED
time="2022-11-03T17:56:12Z" level=debug msg="Time elapsed per stage:"
time="2022-11-03T17:56:12Z" level=debug msg=" network: 8m9s"
time="2022-11-03T17:56:12Z" level=debug msg=" bootstrap: 4m22s"
time="2022-11-03T17:56:12Z" level=debug msg=" master: 11m28s"
time="2022-11-03T17:56:12Z" level=debug msg="Bootstrap Complete: 48s"
time="2022-11-03T17:56:12Z" level=debug msg=" Bootstrap Destroy: 4m53s"
time="2022-11-03T17:56:12Z" level=debug msg=" Cluster Operators: 12m45s"
time="2022-11-03T17:56:12Z" level=info msg="Time elapsed: 43m27s"
I will review machine-controller
logs, but CI may not duplicate the OCPBUGS-1327 failure. I can provide log-bundles for test clusters with these code changes for additional confirmation on the fix, as necessary.
I don't see the MachineReplacement
was required in the CI, which is okay, just unfortunate it wasn't replicated in this CI run.
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-provider-ibmcloud/10/pull-ci-openshift-machine-api-provider-ibmcloud-main-e2e-ibmcloud/1588214891066429440/artifacts/e2e-ibmcloud/gather-extra/artifacts/machines.json
Perhaps maybe we'll see it in the next build, when I update PR per review comments above.
also worth saying, thank you for adding all the comments. it made reviewing this PR much easier :bow:
LOL, it was the only way I could get through writing this complex beast.
4.12.0-0.ci.test-2022-11-04-014628-ci-ln-pwvc30b-latest with the commit merging: #10 4bdfe803 pass
e2e-ibmcloud
failure
time="2022-11-03T22:25:48Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2022-11-03T22:25:48Z" level=info msg="Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {\"code\":\"ACCT-MGMT-11\",\"href\":\"/api/accounts_mgmt/v1/errors/11\",\"id\":\"11\",\"kind\":\"Error\",\"operation_id\":\"cb887a6b-f58f-485c-a5f0-2478677f7adf\",\"reason\":\"Account with ID 2DUeKzzTD9ngfsQ6YgkzdJn1jA4 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates\"}"
ingress 4.13.0-0.ci.test-2022-11-03-210153-ci-op-xiq2lqdi-latest True False True 55m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
I believe this ties back to slow IBM Cloud LB updates/configurations. Given time, the LB would likely be healthy and the ingress canary checks would pass, allowing the remaining operators to become healthy.
Re-kicking test /retest
e2e-ibmcloud
had some unexpected OCP Conformance failures, retrying
/retests
oops, not retests
/retest
@cjschaef: all tests passed!
Full PR test history. Your PR dashboard.
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: JoelSpeed
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@cjschaef: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-1327 has been moved to the MODIFIED state.
/cherry-pick release-4.12
/cherrypick release-4.12
@cjschaef: new pull request created: #11
@cjschaef: new pull request could not be created: failed to create pull request against openshift/machine-api-provider-ibmcloud#release-4.12 from head openshift-cherrypick-robot:cherry-pick-10-to-release-4.12: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for openshift-cherrypick-robot:cherry-pick-10-to-release-4.12."}],"documentation_url":"https://docs.github.com/rest/reference/pulls#create-a-pull-request"}
During initial cluster bring up, some component can prevent the IBM Cloud VSI from reaching the MCS, causing it to get stuck when the DHCP lease expires. Attempt to identify these cases, and attempt a single replacement of the machine to mitigate this issue.
Related: https://issues.redhat.com/browse/OCPBUGS-1327