grafana / grafana-operator

An operator for Grafana that installs and manages Grafana instances, Dashboards and Datasources through Kubernetes/OpenShift CRs
https://grafana.github.io/grafana-operator/
Apache License 2.0
863 stars 384 forks source link

[Bug] Alert Rule Group failed to be applied for 1 out of 1 instances folder with uid not found #1618

Closed zamanh closed 1 month ago

zamanh commented 1 month ago

Describe the bug

Version v5.10.0

To Reproduce Steps to reproduce the behavior:

  1. Attempt to make some alerts
  2. Attempt a deploy

Expected behavior All declared alerts should come into the interface.

Suspect component/Location where the bug might be occurring unknown

Screenshots If applicable, add screenshots to help explain your problem.

image

image

Runtime (please complete the following information):

theSuess commented 1 month ago

Can you verify that a folder with this UID exists in the Grafana instance? If so, does restarting the operator pod fix the issue?

zamanh commented 1 month ago

Can you verify that a folder with this UID exists in the Grafana instance? If so, does restarting the operator pod fix the issue?

Folder exists image

image

However alert fails to apply in the same way specified in the bug.

I have attempted to restart the operator however there appears to be no difference.

zamanh commented 1 month ago

From what I can conclude, my team members and I figured that if you put the alerts into a folder which has the same name as its namespace, it doesn't appear to handle it well. This is the UID of the folder (which is named like the namespace) image

This is the error on one of the Grafanaalertrulegroups image

theSuess commented 1 month ago

Yeah this is an interesting edge-case. This is what happens:

  1. The folder is created through the dashboard resource. It will get a random UID when doing this
  2. The folder resource sees this folder and doesn't create a new one
  3. The alert rule group references the folder resource and tries to get its UID. As the folder resource did not create the folder, the UID does not match so you get the resulting error.

There is a way to resolve this in the upcoming update

  1. Create a folder resource for the namespace
  2. Reference the folder in the dashboard using folderRef (this is not yet released)
  3. Reference the same folder in the alert rule group

This way, the folder resource has full control over the UID and matching will work as expected

We'll release a new version soon - would love to hear back once you updated and tried the new aproach

somebody-nobody commented 1 month ago

I'm also experiencing similar behaviour. If you delete the ARG (kubectl delete grafanaalertrulegroup) and apply it again, it's able to find the folder's uid, and is applied successfully. Looking forward to a fix.

theSuess commented 1 month ago

The fix for this has been released in v5.11 - see my previous comment on how to resolve this in the new version

Baarsgaard commented 1 month ago

Hi @theSuess I love that this was released, it flew under my radar while I was working with the 5.10 version of the chart and Operator image. I tried to upgrade the operator image and chart from 5.10 to 5.11, but the operator did not replace the CRDs in the cluster.

In the end I had to uninstall all of my Grafana resources and have the Operator reinstall the CRDs before I could make use of the new folderRef field. Is there a better way for me to go about this in the future in case you do another update to an existing CRD(Without changing the api version)

theSuess commented 1 month ago

Sadly helm does not support CRD updates. The way to upgrade CRD definitions when using helm is documented here: https://grafana.github.io/grafana-operator/docs/installation/helm/#upgrading

In short, we release CRD manifests alongside every release which you can apply directly

kubectl apply --server-side --force-conflicts -f https://github.com/grafana/grafana-operator/releases/download/v5.11.0/crds.yaml
Baarsgaard commented 1 month ago

Thank you! I was under the impression the CRDs were created by the Operator, which caused a fair amount of confusion on my end. Flux has now been updated with the relevant configs to support CRD updates https://github.com/fluxcd/flux2/issues/3953#issuecomment-1578953884