openshift / machine-api-provider-ibmcloud

Apache License 2.0
0 stars 12 forks source link

OCPCLOUD-2264: IBMCloud: Add boot volume key support #27

Closed cjschaef closed 10 months ago

cjschaef commented 11 months ago

Added a boot volume and encryption key field to the IBMCloudMachineProviderSpec, to allow machines to specify a boot volume encryption key. And added support to specify boot volume encryption key during machine creation.

Related: https://issues.redhat.com//browse/OCPCLOUD-2263 Related: https://issues.redhat.com//browse/OCPCLOUD-2264

cjschaef commented 11 months ago

/retest

cjschaef commented 11 months ago

These test failures appear to be consistent across multiple components' CI tests, and aren't related to these PR changes. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-provider-ibmcloud/27/pull-ci-openshift-machine-api-provider-ibmcloud-main-e2e-ibmcloud/1717541136786001920

This PR should be ready for review.

cjschaef commented 11 months ago

/retitle OCPCLOUD-2263: IBMCloud: Add boot volume key to config spec

openshift-ci-robot commented 11 months ago

@cjschaef: This pull request references OCPCLOUD-2263 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/27): >Added a boot volume and encryption key field to the IBMCloudMachineProviderSpec, to allow machines to specify a boot volume encryption key. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cjschaef commented 11 months ago

/retest-required

cjschaef commented 11 months ago

/retest

openshift-ci-robot commented 11 months ago

@cjschaef: This pull request references OCPCLOUD-2263 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/27): >Added a boot volume and encryption key field to the IBMCloudMachineProviderSpec, to allow machines to specify a boot volume encryption key. And added support to specify boot volume encryption key during machine creation. > >Related: https://issues.redhat.com//browse/OCPCLOUD-2263 >Related: https://issues.redhat.com//browse/OCPCLOUD-2264 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cjschaef commented 11 months ago

/retitle OCPCLOUD-2264: IBMCloud: Add boot volume key support

openshift-ci-robot commented 11 months ago

@cjschaef: This pull request references OCPCLOUD-2264 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/27): >Added a boot volume and encryption key field to the IBMCloudMachineProviderSpec, to allow machines to specify a boot volume encryption key. And added support to specify boot volume encryption key during machine creation. > >Related: https://issues.redhat.com//browse/OCPCLOUD-2263 >Related: https://issues.redhat.com//browse/OCPCLOUD-2264 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
elmiko commented 11 months ago

/hold

the changes look ok to me, @cjschaef what's going on with the failed ibm test?

openshift-ci[bot] commented 11 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-api-provider-ibmcloud/blob/main/OWNERS)~~ [elmiko] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
cjschaef commented 11 months ago

These results look similar to other test flakes I saw two weeks or so ago, so hopefully things are more stable now the server is currently unable to handle the request

/retest e2e-ibmcloud

openshift-ci[bot] commented 11 months ago

@cjschaef: The /retest command does not accept any targets. The following commands are available to trigger required jobs:

The following commands are available to trigger optional jobs:

Use /test all to run all jobs.

In response to [this](https://github.com/openshift/machine-api-provider-ibmcloud/pull/27#issuecomment-1819703441): >These results look similar to other test flakes I saw two weeks or so ago, so hopefully things are more stable now >`the server is currently unable to handle the request` > >/retest e2e-ibmcloud Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cjschaef commented 11 months ago

/test e2e-ibmcloud

jeffnowicki commented 11 months ago

/test e2e-ibmcloud

cjschaef commented 11 months ago

Looks like there have been some leaked VPC resources, I'll have to investigate that further

Creating a new VPC will put the user over quota. Allocated: 20, Requested: 1, Quota: 20
JoelSpeed commented 11 months ago

/lgtm

cjschaef commented 10 months ago

I'll try to fix and get updated ASAP.

jeffnowicki commented 10 months ago

/test e2e-ibmcloud

cjschaef commented 10 months ago

I think we have another issue with the MAPI release version popping up in the e2e-ibmcloud test preventing MAPI from running, MAPI logs:

panic: semver: Parse(0.0.0-0565398): Numeric PreRelease version must not contain leading zeroes "0565398"
elmiko commented 10 months ago

@cjschaef i have a feeling that we are hitting a weird corner case in the build based on the use of git describe in the makefile for the VERSION variable. i created some cards to address this, see https://issues.redhat.com/browse/OCPCLOUD-2227 for more details.

jeffnowicki commented 10 months ago

@elmiko are we able to merge this PR in lieu of the issue you noted?

elmiko commented 10 months ago

@jeffnowicki it depends if this issue is just a flake, but it should be fairly quick to fix the makefile.

cjschaef commented 10 months ago

Hopefully, https://github.com/openshift/machine-api-provider-ibmcloud/pull/29 will resolve the version panics. Will rebase after that merges.

cjschaef commented 10 months ago

29 merged, rebasing.

cjschaef commented 10 months ago

e2e-ibmcloud got past the version flake from prior, OCP Conformance results look unique, going to re-run since I am pretty sure they are unrelated to these MAPI changes.

/retest

jeffnowicki commented 10 months ago

/retest

cjschaef commented 10 months ago

While most look like flakes or commonly failing tests (same in other repos), compared with last run, I am a little less sure on events should not repeat pathologically for ns/openshift-oauth-apiserver, going to retest and also compare results with e2e-ibmcloud from another repo. /retest

cjschaef commented 10 months ago

A test name overlap occurred level=error msg=Error: An A, AAAA, or CNAME record with that host already exists. For more details, refer to <https://developers.IBM.com/dns/manage-dns-records/troubleshooting/records-with-same-name/>

Going to try again /retest

elmiko commented 10 months ago

wonder if this is another clash? i'm not familiar with these error messages.

cjschaef commented 10 months ago

Let me take a look at the CI account to see if I can find out more info.

cjschaef commented 10 months ago

Looks like DNS Records have leaked in the CI. I'll see about cleaning them up, will have to follow up to determine what is allowing that to happen.

cjschaef commented 10 months ago

I completed some cleanup of DNS Records from prior to today (will check tomorrow's), but I think a majority likely leaked due to a bug in my cleanup automation (used to cleanup CI failures during IPI deployments, likely from Infrastructure).

I have a fix to resolve that internally, but hopefully now the chances of a duplicate infraID will be low (around 25 records remaining from today).

/retest

cjschaef commented 10 months ago

Since I see the same initial failure

level=error msg=Error: An A, AAAA, or CNAME record with that host already exists. For more details, refer to <https://developers.ibm.com/dns/manage-dns-records/troubleshooting/records-with-same-name/>.
level=error
level=error msg=  with module.cis.ibm_cis_dns_record.kubernetes_api_internal[0],

which is followed by a second attempt, which fails because the cleanup of the first likely didn't complete fast enough

level=error msg=Error: BucketAlreadyExists: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again.
level=error msg=    status code: 409, request id: 26e62cbd-30fe-4ab5-a4c5-849b554e28d8, host id: 
level=error
level=error msg=  with module.image.ibm_cos_bucket.images,

These have nothing to do with MAPI. I may suspect that IBM Cloud CIS is having intermittent issues, as I do not see a DNS Record that exists related to this failure, and the artifacts appear to be for install attempt 2, so I don't have much more details on what error occurred. I can retrigger, hoping CIS works, as I don't see any notifications for CIS currently. Tomorrow I can try running some local testing to confirm if CIS is normal or not.

/retest

cjschaef commented 10 months ago

I think these latest results look more to what I'd expect, with the monitor/poller failures being common, and the other flakes (disruption tests) popping up on this round.

elmiko commented 10 months ago

thanks for the confirmation @cjschaef , i think the latest results look more like a flake as well. i'm happy to label this, but i'd like to run the tests again to see if we can get a good result.

/lgtm /test e2e-ibmcloud

cjschaef commented 10 months ago

Is this (machine-controller logs) due to an issue with the image build? Something we may need to update in Dockerfile (base image)?

/machine-controller-manager: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /machine-controller-manager)
/machine-controller-manager: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /machine-controller-manager)

Updated

Rebasing, as changes were made to Dockerfile https://github.com/openshift/machine-api-provider-ibmcloud/commit/f7acd33fd76f28a6eadeac4025e8a1151036aa72

cjschaef commented 10 months ago

Same result, will have to wait for images to be fixed

/machine-controller-manager: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /machine-controller-manager)
/machine-controller-manager: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /machine-controller-manager)
cjschaef commented 10 months ago

/test e2e-ibmcloud

cjschaef commented 10 months ago

/retest

cjschaef commented 10 months ago

Still waiting on fix to base image

/machine-controller-manager: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /machine-controller-manager)
/machine-controller-manager: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /machine-controller-manager)
cjschaef commented 10 months ago

/retest

jeffnowicki commented 10 months ago

/retest

jeffnowicki commented 10 months ago

/retest

jeffnowicki commented 10 months ago

@elmiko @JoelSpeed can we progress and merge this PR in lieu of test failure (tied to base image issue, fix PR is up - but uncertain as to how soon it will merge).

We need this PR merged so that Chris can rebase the installer PR.

elmiko commented 10 months ago

@jeffnowicki it seems like we are past the image failures, but it looks like there is some quota or permission problem on the ibm infra. is there anything to be concerned with about that?

cjschaef commented 10 months ago

Something may be up with IBM Cloud COS, going to check. Other failures in the build are because it is retrying to create the COS instance/bucket, but it already existing. So, sounds like something happened or is down with that service, or perhaps IAM, etc. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-provider-ibmcloud/27/pull-ci-openshift-machine-api-provider-ibmcloud-main-e2e-ibmcloud/1731667523503394816

level=error msg=Error: AccessDenied: Access Denied
level=error msg=    status code: 403, request id: ca5fcbba-27fe-4da6-885a-02cc01c6300a, host id: 
level=error
level=error msg=  with module.image.ibm_cos_bucket.images,
level=error msg=  on image/main.tf line 10, in resource "ibm_cos_bucket" "images":
level=error msg=  10: resource "ibm_cos_bucket" "images" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failure applying terraform for "network" stage: error applying Terraform configs: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: AccessDenied: Access Denied
level=error msg=    status code: 403, request id: ca5fcbba-27fe-4da6-885a-02cc01c6300a, host id: 
level=error
level=error msg=  with module.image.ibm_cos_bucket.images,
level=error msg=  on image/main.tf line 10, in resource "ibm_cos_bucket" "images":
level=error msg=  10: resource "ibm_cos_bucket" "images" {

Will see if it was a small blip or whether an outage we need to wait on.

update

Things appear to be ongoing, I will have to monitor and wait to see when they resolve.

cjschaef commented 10 months ago

Things may be better now, will retrigger /retest

cjschaef commented 10 months ago
storage                                    4.15.0-0.ci.test-2023-12-05-184424-ci-op-7zl3z8fc-latest   False       True          False      64m     IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment...
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-controller-79f7bc8c49-74chn                        0/6     CrashLoopBackOff   73 (3m43s ago)   64m     10.128.2.7     ci-op-7zl3z8fc-6383f-ndcz7-worker-1-t8lnb   <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-driver-operator-87965ffc4-5gwvc                    1/1     Running            1 (57m ago)      64m     10.129.0.8     ci-op-7zl3z8fc-6383f-ndcz7-master-1         <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-node-6d6wz                                         0/3     CrashLoopBackOff   37 (2m8s ago)    52m     10.129.2.6     ci-op-7zl3z8fc-6383f-ndcz7-worker-2-txmtf   <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-node-nz72r                                         0/3     CrashLoopBackOff   37 (2m45s ago)   52m     10.128.2.4     ci-op-7zl3z8fc-6383f-ndcz7-worker-1-t8lnb   <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-node-rsvkp                                         0/3     CrashLoopBackOff   37 (2m25s ago)   52m     10.131.0.5     ci-op-7zl3z8fc-6383f-ndcz7-worker-3-54hgx   <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-node-t8hdj                                         0/3     CrashLoopBackOff   46 (74s ago)     64m     10.130.0.41    ci-op-7zl3z8fc-6383f-ndcz7-master-2         <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-node-vh5pz                                         0/3     CrashLoopBackOff   46 (61s ago)     64m     10.128.0.13    ci-op-7zl3z8fc-6383f-ndcz7-master-0         <none>           <none>
openshift-cluster-csi-drivers                      ibm-vpc-block-csi-node-vvphj                                         0/3     CrashLoopBackOff   44 (11s ago)     64m     10.129.0.13    ci-op-7zl3z8fc-6383f-ndcz7-master-1         <none>           <none>

Looks like a storage container is hitting the image error (csi-driver container in ibm-vpc-block-csi-controller pod)

/bin/ibm-vpc-block-csi-driver: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /bin/ibm-vpc-block-csi-driver)
/bin/ibm-vpc-block-csi-driver: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /bin/ibm-vpc-block-csi-driver)
cjschaef commented 10 months ago

I can retry to see if the Storage image was a blip or will need attention too, not that I know what will be required at this time. /retest