Alerting: Fail to query history state events from a Loki-backed instance

kitos9112 commented 1 week ago

What happened?

I have enabled the following feature flags/toggles in my Grafana instance to enhance the alert state history capabilities.

# Feature toggles in grafana.ini
[feature_toggles]
alertStateHistoryLokiSecondary = True
alertStateHistoryLokiPrimary = True
alertStateHistoryLokiOnly = True
alertingListViewV2 = True
alertingCentralAlertHistory = True
alertingNoNormalState = False

[unified_alerting.state_history]
enabled = True
backend = loki
loki_remote_url = http://127.0.254.1:3100

This is my Loki configuration that runs alongside Grafana on the same host. I use systemd as a process/service supervisor

# cat /etc/loki/config.yml
target: all
auth_enabled: False
ballast_bytes: 0
server:
  http_listen_address: 127.0.254.1
  http_listen_port: 3100

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

ruler:
  storage:
    type: local
    local:
      directory: /var/lib/grafana/loki/rules
  rule_path: /var/lib/grafana/loki/rules_tmp
  ring:
    kvstore:
      store: inmemory
  enable_api: true
  enable_alertmanager_v2: false
  alertmanager_url: http://localhost:9093

schema_config:
  configs:
  - from: 2020-10-24
    store: tsdb
    object_store: filesystem
    schema: v13
    index:
      prefix: index_
      period: 24h

compactor:
  working_directory: /var/lib/grafana/loki/compactor
  compaction_interval: 30m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  delete_request_store: filesystem
  compactor_ring:
    kvstore:
      store: inmemory

limits_config:
  retention_period: 180d
  reject_old_samples: false
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  split_queries_by_interval: 24h
  max_query_parallelism: 256
  ingestion_rate_mb: 250
  ingestion_burst_size_mb: 1000
  per_stream_rate_limit: 50MB
  per_stream_rate_limit_burst: 200MB

analytics:
  reporting_enabled: false

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/grafana/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/grafana/loki/chunks
      rules_directory: /var/lib/grafana/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

From time to time, after having both the local Grafana server and Loki instance running for a while (maybe a few days), I encounter I cannot retrieve the state history

On the GUI:

Screenshot

Oct 24 10:54:18 grafana.internal.net grafana[90001]: logger=context userId=2 orgId=1 uname=admin t=2024-10-24T10:54:18.851403158Z level=error msg="Request Completed" method=GET path=/api/v1/rules/history status=500 remote_addr=10.124.50.1 time_ms=3 duration=3.887213ms size=67 referer="https://grafana.internal.net:3000/alerting/history?from=now-1h&to=now" handler=/api/v1/rules/history status_source=server
Oct 24 10:54:18 grafana.internal.net grafana[90001]: logger=context userId=2 orgId=1 uname=admin t=2024-10-24T10:54:18.998141Z level=error msg="ruleUID is required to query annotations" error="ruleUID is required to query annotations" remote_addr=10.124.50.1 traceID=

What did you expect to happen?

Grafana should be able to query all history alert state changes without issues from Loki

Did this work before?

No. I might have some misconfiguration on one end but I cannot pinpoint it.

How do we reproduce it?

Set up a local loki instance on a Linux environment
Set the feature toggles and unified_alerting.state_history settings as per my grafana.ini and restart it
Make sure everything works
Wait for some time
Querying state alert shouldn't work

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana: OS: Browser:

Grafana platform?

A package manager (APT, YUM, BREW, etc.)

Datasource(s)?

No response

yuri-tceretian commented 1 week ago

according to this line

Oct 24 10:54:18 grafana.internal.net grafana[90001]: logger=context userId=2 orgId=1 uname=admin t=2024-10-24T10:54:18.998141Z level=error msg="ruleUID is required to query annotations" error="ruleUID is required to query annotations" remote_addr=10.124.50.1 traceID=

you seem to use an annotation backend https://github.com/grafana/grafana/blob/acb051b3141da5ff668a370e6c2989ee056f16ce/pkg/services/ngalert/state/historian/annotation.go#L112-L116

What version of Grafana do you use?

kitos9112 commented 1 week ago

according to this line
Oct 24 10:54:18 grafana.internal.net grafana[90001]: logger=context userId=2 orgId=1 uname=admin t=2024-10-24T10:54:18.998141Z level=error msg="ruleUID is required to query annotations" error="ruleUID is required to query annotations" remote_addr=10.124.50.1 traceID=
you seem to use an annotation backend https://github.com/grafana/grafana/blob/acb051b3141da5ff668a370e6c2989ee056f16ce/pkg/services/ngalert/state/historian/annotation.go#L112-L116

What version of Grafana do you use?

I managed to replicate it in both v11.2.2 and v11.3.0

yuri-tceretian commented 1 week ago

Unfortunately, I can't reproduce the problem locally. I run a docker-compose file

docker-compose.yaml

```yaml version: "3.4" services: grafana: image: grafana/grafana:11.3.0 ports: - "3001:3000" environment: GF_LOG_LEVEL: debug GF_LOG_MODE: "console" GF_FEATURE_TOGGLES_ENABLE: alertStateHistoryLokiSecondary,alertStateHistoryLokiPrimary,alertStateHistoryLokiOnly,alertingListViewV2,alertingCentralAlertHistory,alertingNoNormalState GF_UNIFIED_ALERTING_STATE_HISTORY_ENABLED: True GF_UNIFIED_ALERTING_STATE_HISTORY_BACKEND: loki GF_UNIFIED_ALERTING_STATE_HISTORY_LOKI_REMOTE_URL: http://127.0.254.1:3100 ```

I would appreciate it if you could provide exact reproduction steps so I could reproduce it.

Wait for some time

I am not sure that time is the factor unless something in your environment changes Grafana configuration because Grafana does not have any fallback mechanisms that, for example, would switch to the annotation backend when Loki is not available.

Anyway, I do not want to eliminate this factor. For how long do you wait until the issue starts occurring?

To troubleshoot it further, can you check your logs for two messages?

Forcing Annotation backend due to state history feature toggles

and

Coercing Loki to a secondary backend due to state history feature toggles

Try to enable debug logs and when the issue starts happening check for messages that contain logger=ngalert.state.historian. Every message should have the context value backend. If it runs in Loki mode, then the value will be loki. Otherwise, it will be "annotations".

kitos9112 commented 1 week ago

@yuri-tceretian

It's a difficult one to reproduce because I'm somehow unable to determine how long we need to keep both daemons running for.

Also, I see blackouts in history state alerts when this occurs from time to time which hinges me that Grafana also stops writing to Loki too. After restarting Grafana it works again.

I'll enable debug mode on both daemons and report back my findings.

grafana / grafana