grafana / grafana-operator

An operator for Grafana that installs and manages Grafana instances, Dashboards and Datasources through Kubernetes/OpenShift CRs
https://grafana.github.io/grafana-operator/
Apache License 2.0
916 stars 397 forks source link

[Bug] Unable to upgrade from v5.6.0 on OpenShift #1438

Closed melledouwsma closed 6 months ago

melledouwsma commented 8 months ago

Describe the bug There's no upgrade path available for OpenShift clusters currently running v5.6.0 of the operator. The change added in #1405 to skip version v5.6.1 is now causing an issue with the OLM update graph. As v5.6.1 is the only version that replaces: v5.6.0 and that version should now be skipped, the OLM has no upgrade path and shows the operator as AtLatestKnown instead of UpgradePending.

Version v5.6.0

To Reproduce Steps to reproduce the behavior:

  1. Create a new namespace, for example grafana-upgrade-example
  2. Prepare the new namespace by creating an OperatorGroup:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: example-operator-group
  namespace: grafana-upgrade-example
spec:
  targetNamespaces:
  - grafana-upgrade-example
  upgradeStrategy: Default
  1. Create a Subscription for v5.6.0 (with manual installPlanApproval to show the behavior):
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: grafana-operator
  namespace: grafana-upgrade-example
spec:
  channel: v5
  installPlanApproval: Manual
  name: grafana-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: grafana-operator.v5.6.0

4 Approve the InstallPlan the OLM created and wait for the operator to be installed

  1. Check the Subscription to see that there are no new versions available:
bash-4.4 ~ $ oc get subscriptions grafana-operator -n grafana-upgrade-example -o json | jq ".status.installedCSV"
"grafana-operator.v5.6.0"
bash-4.4 ~ $ oc get subscriptions grafana-operator -n grafana-upgrade-example -o json | jq ".status.state"
"AtLatestKnown"
  1. Restarting the pods in the openshift-marketplace makes no difference

Expected behavior I would expect the OLM to present v5.6.3 as an upgrade for this installed operator. When you repeat this example with startingCSV: grafana-operator.v5.6.1 you'll see exactly that behavior.

Suspect component/Location where the bug might be occurring This is probably caused by https://github.com/grafana/grafana-operator/blob/04b8181d5ea2ccb4299fa934f1150cc127e0a5f5/bundle/manifests/grafana-operator.clusterserviceversion.yaml#L434C1-L436C30 where v5.6.1 is set to be skipped and there is no alternative upgrade path from v5.6.0. This document has more info on skipping updates.

Screenshots installed-grafana-operator

Runtime (please complete the following information):

NissesSenap commented 8 months ago

@melledouwsma this is a known issue that we can't solve due to limitations of OLM. For more info see: https://github.com/grafana/grafana-operator/issues/1399

To workaround this ether

If you have the operator from community-operators catalog, you can try restarting (deleting) pods from openshift-marketplace/community-operatros deployment. That usually does the trick for me.

else

delete the old operator deployment

melledouwsma commented 8 months ago

Thank you. I am aware of #1399 and I created this new issue because this is a related but different issue. In #1399 the operator is upgraded from v5.6.0 to v5.6.1 and the upgrade fails because of the added labels on the operator deployment. In that case, removing the operator deployment is indeed a perfect workaround.

This is issue is for installations that still have v5.6.0 and have not upgraded to v5.6.1. Those installations do not have a upgrade path to v5.6.3 anymore. Restarting the community-operators pod from the marketplace or removing the old operator deployment makes no difference. There is no new version in the operator catalog anymore that replaces v5.6.0 so the OLM will continue to report that 5.6.0 is the latest version. The situation is specific for clusters that currently have v5.6.0 installed, there is a valid upgrade path from v5.6.1 onwards. This issue is also described in this comment by @ginokok1996.

This issue can be solved by drafting a new release that replaces v5.6.0 in the community-operators catalog, or, as a workaround, remove the CSV and Subscription and then recreate the Subscription with any version above v5.6.0.

ginokok1996 commented 8 months ago

Indeed encountered same issue,

restarting the the community operator pods or the OLM pods doesn't work. The community operator pods are restarted in certain intervals anyway to be up to date.

There just doesn't seem to be an upgrade path from v.5.6.0 to any new versions. You would now have to remove the operator and install at least version v.5.6.3 for it to function normally again.

NissesSenap commented 8 months ago

As explained in the other issue, there is nothing we can do from the maintainers point of view. We are not allowed to do any updates to existing versions of the community or redhat provider. We followed the RedHat manintaienrs suggested workaround by adding a skip flag to the never versions OLM but it seems like it doesn't work.

Furthermore, we have created an issue upstream around this issue: https://github.com/operator-framework/operator-lifecycle-manager/issues/3176, I'm not a redhat employee, and I'm not an OCP customer, so I have no possible way of asking RedHat to prioritize this issue. But I would love you reach out to your sales representative and point to this issue and ask the OLM maintainers to come with a solution.

So instead of doing an uninstal,l it sounds like removing the CSV and subscription sounds like the best solution forward.

melledouwsma commented 8 months ago

Hi @NissesSenap, thanks for explaining. The fix suggested by the maintainers of community-operators did sort-of worked, it marked v5.6.1 as a release that should be skipped. The broken upgrade path is caused by the replaces: grafana-operator.v5.6.2 in the same CSV. When skipping a release, that replaces: is usually filled with the version before the skipped one. For example, when releasing 5.6.2, you'd mark 5.6.1 as skipped and 5.6.2 as the direct replacement of 5.6.0.

It has been a while since I worked with OLM in this much detail, but it should be possible to submit a new release that sorts this out and restores the upgrade path for v5.6.0. I'll look into that and create a Pull Request, but I'd like to run some local tests first to make sure it doesn't create new issues.

NissesSenap commented 8 months ago

Hi @melledouwsma , if there is a good way to sort this ought that would be great, and we would be eternally grateful. To have something to talk about, it's probably easiest if you create a PR in https://github.com/k8s-operatorhub/community-operators/tree/main/operators/grafana-operator, you can just tag me in the PR + link it in this issue and I will look.

Just remember that you can't update any existing releases, life would be much easier if it was possible but apparently not an option....

I will reopen this issue so we can discuss this easier.

NissesSenap commented 8 months ago

Hey @melledouwsma , have you had any time to look in to this?

melledouwsma commented 8 months ago

Hey @NissesSenap, I ran some tests last week, by creating a local CatalogSource and then experimenting with the different options to instruct OLM on a new release. My first attempt fixed 5.6.0 and unfortunately broke the upgrade path for the more recent versions. I have some more ideas and some time later this week, expect a new update in a couple of days.

melledouwsma commented 8 months ago

As mentioned before, the metadata becomes immutable once released and we cannot change it. The operator uses "replaces" mode, where the upgrade graph is created by explicitly specifying one older release in a replaces: attribute on the new release.

The upgrade path is only broken for clusters that are still on v5.6.0. This can be restored by creating a new release with the following attributes in the CSV:

  replaces: grafana-operator.v5.6.0
  skips:
    - grafana-operator.v5.6.1
    - grafana-operator.v5.6.2
    - grafana-operator.v5.6.3
    - grafana-operator.v5.7.0
  version: 5.7.1

This marks the new release as a upgrade for v5.6.0 while still allowing all other versions to upgrade. I did some tests by locally building a CatalogSource and trying the different versions on a OpenShift cluster. The cluster will report "Upgrade available" with all installed versions, including on 5.6.0.

However, because v5.6.1-v.5.7.0 are in the skips: block the only available upgrade is direct to v5.7.1. For example, the upgrade from v5.6.2 to v5.6.3 is no longer offered by the OLM, the cluster will only offer a upgrade to v.5.7.1. It only affects the upgrade path, new installations with a Subscription that contains, for example, startingCSV: grafana-operator.v.5.6.3 are still possible.

This is something to think about, I suppose. If you'd like to continue this route, I'm happy to produce a PR containing the changes to the CSV for a future new release.

NissesSenap commented 8 months ago

Hi @melledouwsma , thanks allot for your work in this! From the operator point of view, it's not an issue going directly to the latest release from 5.6.0.

Could you create a PR with this change in our repo? And I can cut a new release in OLM. I'm at Kubecon next week, so I won't be able to do it then. But I can also ask one of the other maintainers to fix it

github-actions[bot] commented 7 months ago

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label