kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
5 stars 0 forks source link

Grafana: Intermitent DatasourceError on a Loki-based alert #124

Open benoit74 opened 9 months ago

benoit74 commented 9 months ago

See https://grafana.com/orgs/kiwixorg/tickets/106950 for details

benoit74 commented 9 months ago

Grafana ticket content:

We have an alert which has been in place for many days and is suddenly raising a DatasourceError in a very intermittent fashion. Each time, the alert is cleared 5 minutes after the initial error. The alert is based on Loki.

We have had 3 occurrences yesterday (27. Sept). It didn't happened before.

You will find below the details about those occurrences as received in Slack. Times indicated below are UTC.

We are on a Pro account, with everything managed by Grafana (Grafana, Loki, Prometheus). Logs and metrics are coming from a k8s cluster with agent deployed in flow mode.

Could you provide guidance about how to solve this issue?

At 10:31:35 AM

[FIRING:1] DatasourceError GrafanaCloud (grafanacloud-logs A HTTP 500)
**Firing**
Value: [no value]
Labels:
- alertname = DatasourceError
- datasource_uid = grafanacloud-logs
- grafana_folder = GrafanaCloud
- ref_id = A
- rulename = HTTP 500
Annotations:
- Error = [sse.dataQueryError] failed to execute query [A]: Get "https://logs-prod-012.grafana.net/loki/api/v1/query?direction=backward&query=count_over_time%28%7Bpod%21~%22nginx.%2B%22%2C+pod%3D~%22.%2B%22%7D+%7C%3D+%60HTTP%2F1.1%22+500%60+%7C+keep+cluster%2Cnamespace%2Cpod+%5B5m%5D%29&time=1695810660000000000": EOF
- summary = HTTP 500 error detected
Source: https://kiwixorg.grafana.net/alerting/grafana/c3f54e61-48bb-44ce-a0cf-116cc820ec32/view?orgId=1
Silence: https://kiwixorg.grafana.net/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DDatasourceError&matcher=datasource_uid%3Dgrafanacloud-logs&matcher=grafana_folder%3DGrafanaCloud&matcher=ref_id%3DA&matcher=rulename%3DHTTP+500&orgId=1

At 2:03:35PM

[FIRING:1] DatasourceError GrafanaCloud (grafanacloud-logs A HTTP 500)
**Firing**
Value: [no value]
Labels:
- alertname = DatasourceError
- datasource_uid = grafanacloud-logs
- grafana_folder = GrafanaCloud
- ref_id = A
- rulename = HTTP 500
Annotations:
- Error = [sse.dataQueryError] failed to execute query [A]: Get "https://logs-prod-012.grafana.net/loki/api/v1/query?direction=backward&query=count_over_time%28%7Bpod%21~%22nginx.%2B%22%2C+pod%3D~%22.%2B%22%7D+%7C%3D+%60HTTP%2F1.1%22+500%60+%7C+keep+cluster%2Cnamespace%2Cpod+%5B5m%5D%29&time=1695823380000000000": EOF
- summary = HTTP 500 error detected
Source: https://kiwixorg.grafana.net/alerting/grafana/c3f54e61-48bb-44ce-a0cf-116cc820ec32/view?orgId=1
Silence: https://kiwixorg.grafana.net/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DDatasourceError&matcher=datasource_uid%3Dgrafanacloud-logs&matcher=grafana_folder%3DGrafanaCloud&matcher=ref_id%3DA&matcher=rulename%3DHTTP+500&orgId=1

At 7:29:35 PM:

**Firing**
Value: [no value]
Labels:
- alertname = DatasourceError
- datasource_uid = grafanacloud-logs
- grafana_folder = GrafanaCloud
- ref_id = A
- rulename = HTTP 500
Annotations:
- Error = [sse.dataQueryError] failed to execute query [A]: Get "https://logs-prod-012.grafana.net/loki/api/v1/query?direction=backward&query=count_over_time%28%7Bpod%21~%22nginx.%2B%22%2C+pod%3D~%22.%2B%22%7D+%7C%3D+%60HTTP%2F1.1%22+500%60+%7C+keep+cluster%2Cnamespace%2Cpod+%5B5m%5D%29&time=1695842940000000000": EOF
- summary = HTTP 500 error detected
Source: https://kiwixorg.grafana.net/alerting/grafana/c3f54e61-48bb-44ce-a0cf-116cc820ec32/view?orgId=1
Silence: https://kiwixorg.grafana.net/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DDatasourceError&matcher=datasource_uid%3Dgrafanacloud-logs&matcher=grafana_folder%3DGrafanaCloud&matcher=ref_id%3DA&matcher=rulename%3DHTTP+500&orgId=1