[Bug] growth in alert_rule_version table

yaskinny commented 1 month ago

Describe the Bug The operator appears to cause unexpected growth in the alert_rule_version table. I haven't investigated the root cause deeply, but the size of this table increases even without any updates. For example, I have set the re-evaluation interval for alerts to 10 minutes. Every 10 minutes, 500 new records are added to the table. I didn't check the diff between records to add more context but growth in records number is obvious. Additionally, when I delete a grafanaalertrule Custom Resource (CR) from the cluster, a large number of records are removed from this table, depending on how long the rule has existed—since every 10 minutes, multiple records are added for that specific grafanaalertrule. After stopping the operator, the growth in the table ceased.

I haven't updated to the latest version yet because I haven't found any mention of this issue in the release notes or in the repository's issue tracker. Version v5.9.1

To Reproduce

Create alerts.
Set the evaluation interval to X minutes.
Check count of records in the alert_rule_version table

(I'm using PG 16 for database)

pb82 commented 1 month ago

@yaskinny Could this be versioning applied by Grafana (like it does with dashboards)? In this case, it's not an Operator issue. Or does this not happen when not using the Grafana Operator?

yaskinny commented 1 month ago

@pb82 I'm not sure operator or grafana fault this issue is.

I haven't yet got time to dig deeper and find the root cause, but the obvious thing is that as soon as i stop operator my table size does not grow anymore.

I have a doubt that there's a field in the alerts which operator is sending to grafana and that field is making grafana think that alert is updated and it is newer and causes a new record on the table. I'm not sure what that field is and where it should be handled(maybe its operator and has to either change that field to a dynamic data based on rule state not something random each time or it's grafana and does not check a field correctly).

if i get time, I'll investigate more and share the results with you.

here a sample alert that I'm using:

  - annotations:
      description: Rabbitmq node {{ index $labels "instance" }} has {{ $values.A.value |
        humanize }}
      summary: Rabbitmq Memory Limit
    condition: A
    execErrState: KeepLast
    for: 5m0s
    labels:
      severity: critical
      team: sre
    noDataState: OK
    title: RabbitmqMemoryLimitMetrics
    uid: ec0c9410a4c0af1ccf58cb23249de30d4addbb5b
    data:
    - datasourceUid: t-metrics
      model:
        datasource:
          type: prometheus
          uid: t-metrics
        editorMode: code
        expr: >-
          1 - (rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes ) < 0.25
        instant: true
        intervalMs: 25000
        legendFormat: __auto
        maxDataPoints: 43200
        range: false
        refId: A
      refId: A
      relativeTimeRange:
        from: 600

theSuess commented 3 weeks ago

I'll try to reproduce the issue this week. If this is the case for all alerts, this should definetly be fixed soon

DrDJIng commented 3 weeks ago

I don't think this is an issue with the operator, but with grafana itself. We have had similar issues with this table using the sidecar provisioning:

Grafana Issue

I never had the time to deep dive into the Grafana code to find the issue, but my gut says there's some logic issues when comparing alerts to their old versions, causing an infinite growth.

We are seeing the same table growth now that we've switched to Operator, though much, much slower growth.

Our admittedly bad solution is an automated truncation on the table itself.

redisded commented 6 hours ago

Hello, just to say we have the exact same problem here, using argocd and grafana-operator. I've commented the grafana issue, as this seems more related to grafana itself than the operator, but I can provide more information on our installation or perform further tests if it can help reproduce the issue.

grafana / grafana-operator

[Bug] growth in alert_rule_version table #1639