Open yaskinny opened 1 month ago
@yaskinny Could this be versioning applied by Grafana (like it does with dashboards)? In this case, it's not an Operator issue. Or does this not happen when not using the Grafana Operator?
@pb82 I'm not sure operator or grafana fault this issue is.
I haven't yet got time to dig deeper and find the root cause, but the obvious thing is that as soon as i stop operator my table size does not grow anymore.
I have a doubt that there's a field in the alerts which operator is sending to grafana and that field is making grafana think that alert is updated and it is newer and causes a new record on the table. I'm not sure what that field is and where it should be handled(maybe its operator and has to either change that field to a dynamic data based on rule state not something random each time or it's grafana and does not check a field correctly).
if i get time, I'll investigate more and share the results with you.
here a sample alert that I'm using:
- annotations:
description: Rabbitmq node {{ index $labels "instance" }} has {{ $values.A.value |
humanize }}
summary: Rabbitmq Memory Limit
condition: A
execErrState: KeepLast
for: 5m0s
labels:
severity: critical
team: sre
noDataState: OK
title: RabbitmqMemoryLimitMetrics
uid: ec0c9410a4c0af1ccf58cb23249de30d4addbb5b
data:
- datasourceUid: t-metrics
model:
datasource:
type: prometheus
uid: t-metrics
editorMode: code
expr: >-
1 - (rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes ) < 0.25
instant: true
intervalMs: 25000
legendFormat: __auto
maxDataPoints: 43200
range: false
refId: A
refId: A
relativeTimeRange:
from: 600
I'll try to reproduce the issue this week. If this is the case for all alerts, this should definetly be fixed soon
I don't think this is an issue with the operator, but with grafana itself. We have had similar issues with this table using the sidecar provisioning:
I never had the time to deep dive into the Grafana code to find the issue, but my gut says there's some logic issues when comparing alerts to their old versions, causing an infinite growth.
We are seeing the same table growth now that we've switched to Operator, though much, much slower growth.
Our admittedly bad solution is an automated truncation on the table itself.
Hello, just to say we have the exact same problem here, using argocd and grafana-operator. I've commented the grafana issue, as this seems more related to grafana itself than the operator, but I can provide more information on our installation or perform further tests if it can help reproduce the issue.
Describe the Bug The operator appears to cause unexpected growth in the
alert_rule_version
table. I haven't investigated the root cause deeply, but the size of this table increases even without any updates. For example, I have set the re-evaluation interval for alerts to 10 minutes. Every 10 minutes, 500 new records are added to the table. I didn't check the diff between records to add more context but growth in records number is obvious. Additionally, when I delete agrafanaalertrule
Custom Resource (CR) from the cluster, a large number of records are removed from this table, depending on how long the rule has existed—since every 10 minutes, multiple records are added for that specificgrafanaalertrule
. After stopping the operator, the growth in the table ceased.I haven't updated to the latest version yet because I haven't found any mention of this issue in the release notes or in the repository's issue tracker. Version v5.9.1
To Reproduce
(I'm using PG 16 for database)