nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Operators in test and prod clusters are getting hung up when attempting version upgrades #556

Closed dystewart closed 1 month ago

dystewart commented 2 months ago

In 537 we saw a few operators were out of sync with updates. Upon closer investigation there is a similar error message across the operator subscription statuses that looks like this:

constraints not satisfiable: subscription amq-broker-rhel8 exists, clusterserviceversion amq-broker-operator.v7.11.5-opr-1 exists and is not referenced by a subscription

This looks like a bug as the operators are seemingly falling off the upgrade path despite coming from a valid csv.

Going to look to at OCP issues to see if anything has come up regarding this..

In the test cluster uninstalling the operator and reinstalling is a simple fix but we may not always want to uninstall the operator (thus destroying all it's resources and crds)

dystewart commented 2 months ago

Also need to find out if this is happening in obs and infra clusters

joachimweyl commented 1 month ago

@dystewart what sprint are you hoping to work on this and do you have an estimate of how much effort this issue will be?

tssala23 commented 1 month ago

There is a similar issue in the Red Hat Customer Portal however it is marked solution unverified. There's a work around but no permanent solution. This issue is also from this time last year so a little dated.

tssala23 commented 1 month ago

Here's another similar issue from last year December, this one has a resolution which links to a much newer post:

Refer to Operator cannot be upgraded with the error "Cannot update: CatalogSource was removed" while the CatalogSource exists in OpenShift 4, and check if the Subscription contains a wrong startingCSV field. Or In other cases, restarting the catalog-operator pod in the openshift-operator-lifecycle-manager namespace as shown below will resolve the problem

tssala23 commented 1 month ago

Using the command mentioned in the diagnostic steps in from the issue in the latter comment, it would appear the the obs and infra clusters have the same issue.

tssala23 commented 1 month ago

The problem that we're having may also be operator specific as I found this issue related to the amq operator https://access.redhat.com/solutions/7056241

tssala23 commented 1 month ago

The two clusters were having issues with different operators. Prod AMQ-Broker operator Issue was with the deployment. After deleting deployment and CSV the operator was able to successfully upgrade.

Test NFD Operator Issues was with CRs. Solution found on Red Hat Knowledgebase https://access.redhat.com/solutions/7057312. Operator has successfully upgraded.