grafana / grafana-operator

An operator for Grafana that installs and manages Grafana instances, Dashboards and Datasources through Kubernetes/OpenShift CRs
https://grafana.github.io/grafana-operator/
Apache License 2.0
863 stars 384 forks source link

[Bug] "NoMatchingFolder" when GrafanaAlertRuleGroup created in the same helm/kubectl apply as the Grafana resource #1578

Closed matthew-s-walker closed 2 months ago

matthew-s-walker commented 2 months ago

Describe the bug We're trying to set up GrafanaAlertRuleGroup resources in the same helm chart as the one which creates the Grafana resource. When we install the chart, the GrafanaAlertRuleGroup goes into a "NoMatchingFolder" state until you delete and recreate the alert resource. Setting the "resyncPeriod" to low values doesn't seem to get the resource out of this state in any reasonable time.

The issue is not reproducible when running against the "make start-kind" and "make run" setup (which is running a newer k8s version than we are), but it does reproduce when pointing a local operator at our EKS cluster. I imagine the difference in behaviour could be due to latency, or the fact that we have to modify the hostfile to point the "grafana-service.testnamespace" to 127.0.0.1 and run kubectl port-forward to allow the operator to connect to Grafana.

Version v5.9.1

To Reproduce Steps to reproduce the behavior:

  1. kubectl create namespace testnamespace
  2. kubectl apply -f
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  namespace: testnamespace
  labels:
    dashboards: "grafana"
spec:
  config:
    log:
      mode: "console"
    auth:
      disable_login_form: "false"
    security:
      admin_user: root
      admin_password: secret

---

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaFolder
metadata:
  name: test-folder
  namespace: testnamespace
spec:
  resyncPeriod: 5s
  instanceSelector:
    matchLabels:
      dashboards: "grafana"

---

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaAlertRuleGroup
metadata:
  name: grafanaalertrulegroup-sample
  namespace: testnamespace
spec:
  resyncPeriod: 5s
  folderRef: test-folder
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  interval: 5m
  rules:
    - condition: B
      data:
        - datasourceUid: grafanacloud-demoinfra-prom
          model:
            datasource:
                type: prometheus
                uid: grafanacloud-demoinfra-prom
            editorMode: code
            expr: weather_temp_c{}
            instant: true
            intervalMs: 1000
            legendFormat: __auto
            maxDataPoints: 43200
            range: false
            refId: A
          refId: A
          relativeTimeRange:
            from: 600
        - datasourceUid: __expr__
          model:
            conditions:
                - evaluator:
                    params:
                        - 0
                    type: lt
                  operator:
                    type: and
                  query:
                    params:
                        - C
                  reducer:
                    params: []
                    type: last
                  type: query
            datasource:
                type: __expr__
                uid: __expr__
            expression: A
            intervalMs: 1000
            maxDataPoints: 43200
            refId: B
            type: threshold
          refId: B
          relativeTimeRange:
            from: 600
      execErrState: Error
      for: 5m0s
      noDataState: NoData
      title: Temperature below zero
      uid: 4843de5c-4f8a-4af0-9509-23526a04faf8
  1. See that the GrafanaAlertRuleGroup goes into the NoMatchingFolder state

Expected behavior Either to begin with or after a resyncPeriod, the rule group should be properly reconsiled.

Suspect component/Location where the bug might be occurring The grafanaalertrulegroup_controller.go can be modified to retry finding the folder, which fixes the issue but I'm not confident that this is the best way to fix this:

diff --git a/controllers/grafanaalertrulegroup_controller.go b/controllers/grafanaalertrulegroup_controller.go
index 9db2c0b0..1f12b60c 100644
--- a/controllers/grafanaalertrulegroup_controller.go
+++ b/controllers/grafanaalertrulegroup_controller.go
@@ -363,17 +363,26 @@ func (r *GrafanaAlertRuleGroupReconciler) GetFolderUID(ctx context.Context, grou
                return group.Spec.FolderUID
        }
        var folder grafanav1beta1.GrafanaFolder
-       err := r.Client.Get(ctx, client.ObjectKey{
-               Namespace: group.Namespace,
-               Name:      group.Spec.FolderRef,
-       }, &folder)
-       if err != nil {
-               if kuberr.IsNotFound(err) {
-                       setNoMatchingFolder(&group.Status.Conditions, group.Generation, "NotFound", fmt.Sprintf("Folder with name %s not found in namespace %s", group.Spec.FolderRef, group.Namespace))
-                       return ""
+       const maxRetries = 5
+       const sleepDuration = 5 * time.Second
+
+       retryCount := 0
+       var err error
+       for retryCount < maxRetries {
+               err := r.Client.Get(ctx, client.ObjectKey{
+                       Namespace: group.Namespace,
+                       Name:      group.Spec.FolderRef,
+               }, &folder)
+               if err == nil {
+                       return string(folder.UID)
                }
-               setNoMatchingFolder(&group.Status.Conditions, group.Generation, "ErrFetchingFolder", fmt.Sprintf("Failed to fetch folder: %s", err.Error()))
+               retryCount++
+               time.Sleep(sleepDuration)
+       }
+       if kuberr.IsNotFound(err) {
+               setNoMatchingFolder(&group.Status.Conditions, group.Generation, "NotFound", fmt.Sprintf("Folder with name %s not found in namespace %s", group.Spec.FolderRef
, group.Namespace))
                return ""
        }
-       return string(folder.UID)
+       setNoMatchingFolder(&group.Status.Conditions, group.Generation, "ErrFetchingFolder", fmt.Sprintf("Failed to fetch folder: %s", err.Error()))
+       return ""
 }

Screenshots screenshot

Runtime (please complete the following information):

theSuess commented 2 months ago

Thanks for pointing this out! It seems to be an issue with error propagation. We'll implement a fix for this soon

msvechla commented 2 months ago

we are running into the same issue, thanks for raising this here. If any support is required to resolve this, please let me know!

theSuess commented 2 months ago

Fixed in #1584 - will cut a release soon

msvechla commented 2 months ago

@theSuess is there an estimate for when the next release will happen? We are still blocked by this issue.

Thanks a lot!

theSuess commented 2 months ago

Hey, sorry for the late reply. I was at an offsite and did not have time to do the release. We're looking into finishing up nested folder support (only missing #1600 for the MVP) and will probably cut a release after that.