trustification / trustify

Apache License 2.0
10 stars 19 forks source link

Latest release deployment is broken after OCP changes #928

Closed jcrossley3 closed 4 weeks ago

jcrossley3 commented 4 weeks ago

I don't know for sure if @ctron 's recent configuration changes to our OCP cluster are the cause of the error, but the deploy job in the latest-release workflow hasn't succeeded ever since. It fails with the following error:

+ operator-sdk run bundle ghcr.io/trustification/trustify-operator-bundle:latest --namespace trustify-latest --timeout 15m
time="2024-10-17T10:31:35Z" level=info msg="Creating a File-Based Catalog of the bundle \"ghcr.io/trustification/trustify-operator-bundle:latest\""
time="2024-10-17T10:31:35Z" level=info msg="Generated a valid File-Based Catalog"
time="2024-10-17T10:31:39Z" level=info msg="Created registry pod: ghcr-io-trustification-trustify-operator-bundle-latest"
time="2024-10-17T10:31:39Z" level=info msg="Created CatalogSource: trustify-operator-catalog"
time="2024-10-17T10:31:39Z" level=info msg="OperatorGroup \"operator-sdk-og\" created"
time="2024-10-17T10:31:39Z" level=info msg="Created Subscription: trustify-operator-v1-0-0-snapshot-sub"
time="2024-10-17T10:46:27Z" level=fatal msg="Failed to run bundle: install plan is not available for the subscription trustify-operator-v1-0-0-snapshot-sub: Get \"[https://api.cluster.trustification.rocks:6443/apis/operators.coreos.com/v1alpha1/namespaces/trustify-latest/subscriptions/trustify-operator-v1-0-0-snapshot-sub\](https://api.cluster.trustification.rocks:6443/apis/operators.coreos.com/v1alpha1/namespaces/trustify-latest/subscriptions/trustify-operator-v1-0-0-snapshot-sub/)": context deadline exceeded\n"

I tried updating to the latest operator-sdk release, but it didn't help.

I'm hoping @carlosthe19916 might offer some guidance on how to diagnose the problem.

carlosthe19916 commented 4 weeks ago

@jcrossley3 I'll investigate tomorrow. Thanks for sharing the log. I'll see what changed recently for it to fail now

ctron commented 4 weeks ago

The cluster was converted into a multi-arch cluster. I am not sure what that means internally. Some nodes are now arm64 nodes. There's some information about multi-arch clusters here: https://docs.openshift.com/container-platform/4.16/post_installation_configuration/configuring-multi-arch-compute-machines/multi-architecture-compute-managing.html

For trustification (v1) I had to amend the Helm charts in a way that they only deploy pods on amd64, as we don't build and publish multi-arch images for v1.

jcrossley3 commented 4 weeks ago

Is it easy enough to revert the multi-arch conversion? If the deploy job then works, at least we could confirm the correlation.

ctron commented 4 weeks ago

IIRC it was impossible.

ctron commented 4 weeks ago

Yea, not possible:

Migration from a multi-architecture payload to a single-architecture payload is not supported. Once a cluster has transitioned to using a multi-architecture payload, it can no longer accept a single-architecture update payload.

carlosthe19916 commented 4 weeks ago

https://github.com/trustification/trustify/actions/workflows/latest-release.yaml is working again. I made the operator to create arm64 images for the bundle and apparently it solved the problem. This is the PR that made the arm64 possible for the bundles in the operator repo https://github.com/trustification/trustify-operator/pull/36

jcrossley3 commented 4 weeks ago

I'd love to know how you figured that out, but I'm afraid I may lack the expertise required to understand it.

carlosthe19916 commented 4 weeks ago

I'd love to know how you figured that out, but I'm afraid I may lack the expertise required to understand it.

I think it was just being lucky. As soon as I started reading this issue today, @ctron already mentioned the "multi-arch" change in the OCP cluster (which to me would be almost impossible to track as I know almost nothing about that OCP instance).

jcrossley3 commented 4 weeks ago

I think you're being humble, but as they say in the casino, luck counts!