grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.99k stars 3.46k forks source link

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

Closed LukoJy3D closed 2 weeks ago

LukoJy3D commented 1 month ago

Describe the bug When alerts are deployed to Loki, it produces errors with status=500, and alerts are completely inconsistent compared to what is returned manually in grafana.

"log": "level=info ts=2024-10-09T19:46:01.141151674Z caller=engine.go:248 component=ruler evaluation_mode=local org_id=fake msg=\"executing query\" type=instant query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986"

"log": "level=info ts=2024-10-09T19:46:01.228229623Z caller=metrics.go:217 component=ruler evaluation_mode=local org_id=fake latency=fast query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986 query_type=metric range_type=instant length=0s start_delta=694.286852ms end_delta=694.286992ms step=0s duration=87.000739ms status=500 limit=0 returned_lines=0 throughput=4.1GB total_bytes=357MB total_bytes_structured_metadata=715kB lines_per_second=847349 total_lines=73720 post_filter_lines=48 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=155.03µs cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=73 ingester_requests=8 ingester_chunk_head_bytes=2.5MB ingester_chunk_compressed_bytes=41MB ingester_chunk_decompressed_bytes=354MB ingester_post_filter_lines=48 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s disable_pipeline_wrappers=false"

Ingesters do not seem to have terrible latency: Image

Alerts keep going into a pending state and even firing sometimes even though on manual query, it never goes below threshold: Image

To Reproduce Steps to reproduce the behavior:

  1. Running Loki Chart 6.16.0 in SSD
  2. deploy alert to Loki using https://github.com/grafana/cortex-rules-action
    groups:
    - name: AiErrorAlerts
    rules:
      - alert: LowCheckCount
        expr: sum(count_over_time({namespace="ai"} |= `check completed` [10m])) < 30
        for: 10m
        labels:
          route: alerts-ai
          severity: medium
          source: loki
        annotations:
          summary: Low check completion rate.
          details: Checks completed per 10 minutes - `{{ $value }}`. Expected rate - `30`.
  3. Ruler config:
    rulerConfig:
    alertmanager_url: "https://<mimir_am>:8080/alertmanager/"
    enable_alertmanager_v2: true
    enable_sharding: true
    remote_write:
      enabled: true
      clients:
        local:
          url: "http://haproxy.prometheus.svc.cluster.local:8080/api/v1/push"
    storage:
      gcs:
        bucket_name: hostinger-loki-ruler
    wal:
      dir: /var/loki/wal

Expected behavior Alerts are evaluated properly without randomly going into a pending/firing state.

Environment:

yapishu commented 3 weeks ago

I'm also experiencing frustrating issues with ruler in the 6.18.0 chart (app version 3.2.0); the querier seems to evaluate correctly (e.g. printing the number of lines evaluated, matches, throughput etc in the debug logs) for queries that come from the ruler, but they always print status 500, even for queries that don't hit the logs (like vector(1) > 0), and alerting rules never seem to trigger, which I presume means that the queries aren't returning results. This only applies to queries that come from the ruler, and only seems to apply to alerting rules -- if I make a recording rule with the same query, it works fine; identical queries that come through the frontend (ie curls) work perfectly. This has been driving me crazy for weeks.

OP seemed to indicate the switching from local to remote evaluation fixed his problem, but it did not for me. I switched my deployment to distributed and spent a couple of days playing with settings trying to figure out where the breakdown was, but the querier never stopped printing 500 status codes, even when it seemed to be working normally. There was a single point in time that my queries started populating alerts -- some combination of local rule evaluation and distributed mode -- and I tried to preserve these settings in SimpleScalable deployment, but it stopped working immediately.

gyoza commented 3 weeks ago

Working on migrating from loki-distributed to this chart 6.18.0 and app version 3.2.0 in distributed mode have also produced status=500 ruler alerts using the following config.

rulerConfig:
    alertmanager_url: http://${ALTERMANAGER_URL}
    external_url: https://${EXTERNAL_URL}
    enable_alertmanager_v2: true
    storage:
      local:
        directory: /etc/loki/rules
      type: local

For the record logcli against the new loki 3.2.0 gateway produces the desired results.

LukoJy3D commented 2 weeks ago

Most of my problems disappeared after changing the ruler evaluation mode to remote.

evaluation:
      mode: remote
      query_frontend:
        address: loki-read.loki.svc.cluster.local:9095

Sadly, I needed to dig docs quite a lot to find this, as you would expect it to be under ruler/alerting docs. That's why I opened a PR linked to this issue so that this kind of info would be more easily accessible in the alerting docs.

@JStickler I would appreciate it if you could review it 🙇

JStickler commented 2 weeks ago

@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version.

gyoza commented 2 weeks ago

For the record I am running in distributed mode in the latest helm chart 6.18.0 and coincidentally I came across this fix today and made this change.

The ruler is now running and not returning 500 status on alerts but the query-front-end is returning total_lines=0 where source = ruler as logcli is getting total_lines=92220 using the same query as the alert directly against the gateway.

Any idea how to debug this? Nothing seems to be complaining about anything in any of the logs in question.

LukoJy3D commented 2 weeks ago

@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version.

Cortex CI action was just something we found very useful when deploying rules, which blended very nicely with existing prometheus alert deployment to Mimir using mimirtool in a similar CI fashion. So, I just wanted to share with others. It's not necessarily should be part of that PR.

The ruler is now running and not returning 500 status on alerts but the query-front-end is returning total_lines=0 where source = ruler as logcli is getting total_lines=92220 using the same query as the alert directly against the gateway.

Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see total_lines=0 in Loki logs, yet it does not seem like it signals something not working.

gyoza commented 2 weeks ago

Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see total_lines=0 in Loki logs, yet it does not seem like it signals something not working.

They are the same query but maybe I am missing something! I will focus some effort here! Thanks Luko!