Loki ruler returns incorrect results after an outage of Kubernetes Cluster

zyxbest commented 2 years ago

Describe the bug After an outage of Kubernetes, the recording rule of Loki Ruler returns weird incorrect query results which are different from the ones on Grafana "Explore" page.

To Reproduce Steps to reproduce the behavior:

A working distributed Loki cluster v2.5.0

config a simple recording rule as an example

rules:
  - record: spotfire_log_count_1d
    expr: |
      count_over_time({job="wptiminglog"} |= "Open Analysis" [1d])

The ruler would run the query in Loki and send the result to Prometheus.

Run the raw query in Loki.
Run the new record query in Prometheus
Compare the results of step3 and step4, the raw query returns correct results while the new record from the ruler returns values as a series of integers 1,2,3,4,5......

other actions I have tried but don't work:

disable the Memcache
restart the whole Loki cluster
Upgrade Loki ruler from v2.5.0 to v.2.6.0

Expected behavior The new record metrics in Prometheus should be the same as the ones in Loki, the value should be a larger count, not 1,2,3,4,5...

Environment:

Infrastructure: Kubernetes
Deployment tool: helm
Version: v2.5.0
- Loki Storage: Azure Blob

Screenshots, Promtail config, or terminal output

Log

level=info ts=2022-09-05T04:17:49.368788806Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="sum by(template,tenant_id)(count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\" | regexp \"^([^;]*);(?P<hostname>[^;]*);(?P<timeStamp>[^;]*);(?P<timestamputc>[^;]*);(?P<endtime>[^;]*);(?P<duration>[^;]*);(?P<sessionid>[^;]*);([^;]*);([^;]*);(?P<operation>[^;]*);([^;]*);(?P<template>[^;]*)\"[10d]))" query_type=metric range_type=instant length=0s step=0s duration=1.861323ms status=200 limit=0 returned_lines=0 throughput=5.0MB total_bytes=9.4kB total_entries=3 queue_time=0s subqueries=1
ts=2022-09-05T04:18:01.689854307Z caller=spanlogger.go:80 level=info msg="index list cache built" duration=12.466904095s
level=info ts=2022-09-05T04:18:01.820402413Z caller=index_set.go:422 table-name=loki_index_19240 file-name=loki-loki-distributed-ingester-1-1662010967895582488-1662350400.gz msg="downloaded file" total_time=126.974762ms
level=info ts=2022-09-05T04:18:01.820553115Z caller=index_set.go:422 table-name=loki_index_19240 file-name=loki-loki-distributed-ingester-0-1662010999371274027-1662350400.gz msg="downloaded file" total_time=127.160465ms
level=info ts=2022-09-05T04:18:01.845580023Z caller=table_manager.go:252 msg="query readiness setup completed" duration=1.7µs distinct_users_len=0
level=info ts=2022-09-05T04:18:49.336852385Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])" query_type=metric range_type=instant length=0s step=0s duration=1.399917ms status=200 limit=0 returned_lines=0 throughput=7.0MB total_bytes=9.8kB total_entries=4 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:18:49.337558094Z caller=manager.go:202 storage=registry manager=tenant-wal msg="dynamically updated instance" instance=..data
level=info ts=2022-09-05T04:18:49.365422536Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({filename=\"/var/log/nginx/json_access.log\"}[1m])" query_type=metric range_type=instant length=0s step=0s duration=27.604839ms status=200 limit=0 returned_lines=0 throughput=845MB total_bytes=23MB total_entries=384 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:18:49.369179583Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="sum by(template,tenant_id)(count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\" | regexp \"^([^;]*);(?P<hostname>[^;]*);(?P<timeStamp>[^;]*);(?P<timestamputc>[^;]*);(?P<endtime>[^;]*);(?P<duration>[^;]*);(?P<sessionid>[^;]*);([^;]*);([^;]*);(?P<operation>[^;]*);([^;]*);(?P<template>[^;]*)\"[10d]))" query_type=metric range_type=instant length=0s step=0s duration=1.882223ms status=200 limit=0 returned_lines=0 throughput=5.2MB total_bytes=9.8kB total_entries=3 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:19:49.338319059Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])" query_type=metric range_type=instant length=0s step=0s duration=2.207826ms status=200 limit=0 returned_lines=0 throughput=4.9MB total_bytes=11kB total_entries=4 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:19:49.338804264Z caller=manager.go:202 storage=registry manager=tenant-wal msg="dynamically updated instance" instance=..data
level=info ts=2022-09-05T04:19:49.346220953Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({filename=\"/var/log/nginx/json_access.log\"}[1m])" query_type=metric range_type=instant length=0s step=0s duration=7.195385ms status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B total_entries=0 queue_time=0s subqueries=1
level=warn ts=2022-09-05T04:19:49.346279953Z caller=manager.go:610 user=..data group=spotfire-test msg="Evaluating rule failed" rule="record: spotfire_demo\nexpr: count_over_time({filename=\"/var/log/nginx/json_access.log\"}[1m])\n" err="wrong chunk metadata"
level=info ts=2022-09-05T04:19:49.348430279Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="sum by(template,tenant_id)(count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\" | regexp \"^([^;]*);(?P<hostname>[^;]*);(?P<timeStamp>[^;]*);(?P<timestamputc>[^;]*);(?P<endtime>[^;]*);(?P<duration>[^;]*);(?P<sessionid>[^;]*);([^;]*);([^;]*);(?P<operation>[^;]*);([^;]*);(?P<template>[^;]*)\"[10d]))" query_type=metric range_type=instant length=0s step=0s duration=1.986223ms status=200 limit=0 returned_lines=0 throughput=5.4MB total_bytes=11kB total_entries=3 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:20:49.337084538Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])" query_type=metric range_type=instant length=0s step=0s duration=1.326616ms status=200 limit=0 returned_lines=0 throughput=8.1MB total_bytes=11kB total_entries=4 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:20:49.337672645Z caller=manager.go:202 storage=registry manager=tenant-wal msg="dynamically updated instance" instance=..data
level=info ts=2022-09-05T04:20:49.357146784Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({filename=\"/var/log/nginx/json_access.log\"}[1m])" query_type=metric range_type=instant length=0s step=0s duration=19.237536ms status=200 limit=0 returned_lines=0 throughput=626MB total_bytes=12MB total_entries=259 queue_time=0s subqueries=1

sometimes there are error logs with "wrong chunk metadata" (I'm sure that I put the config into the right folder, and this error just occurs a little)

level=warn ts=2022-09-05T04:04:49.663172799Z caller=background.go:127 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=10.244.9.201:11211: circuit breaker is open"
level=info ts=2022-09-05T04:05:49.344820077Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])" query_type=metric range_type=instant length=0s step=0s duration=8.773808ms status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B total_entries=0 queue_time=0s subqueries=1
level=warn ts=2022-09-05T04:05:49.344887778Z caller=manager.go:610 user=..data group=spotfire-test msg="Evaluating rule failed" rule="record: spotfire_log_count_1d\nexpr: count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])\n" err="wrong chunk metadata"
level=info ts=2022-09-05T04:05:49.367880661Z caller=metrics.go:133 component=ruler org_id=..data latency=fast query="count_over_time({filename=\"/var/log/nginx/json_access.log\"}[1m])" query_type=metric range_type=instant length=0s step=0s duration=22.826281ms status=200 limit=0 returned_lines=0 throughput=637MB total_bytes=14MB total_entries=363 queue_time=0s subqueries=1
level=info ts=2022-09-05T04:05:49.368779873Z caller=manager.go:202 storage=registry manager=tenant-wal msg="dynamically updated instance" instance=..data

dannykopping commented 2 years ago

Thank you for the very detailed description of the problem @qrr1995!

Was your prometheus server also impacted by the same outage?

The ruler acts like a querier, so whatever query results you're seeing by querying Loki directly should be identical to the ruler.

https://grafana.com/docs/loki/latest/operations/recording-rules/#observability Please consult this documentation to see if any of those metrics show any problems.

zyxbest commented 2 years ago

@dannykopping Thanks for your reply.

My Prometheus server (which is actually Victoriametrics) also suffered the outage and get recovered later.

That's my question why the query behaves differently between ruler and Grafana Explorer, does it look like a bug?

Since the ruler works perfectly in the last several months before the outage, the query results were also identical.

dannykopping commented 2 years ago

That's my question why the query behaves differently between ruler and Grafana Explorer, does it look like a bug?

Well, you're not comparing the ruler and your Loki queries - you're comparing the metrics in VM and Loki.

zyxbest commented 2 years ago

@dannykopping Sorry, I don't really get it, cuz in my early experience, the Loki ruler always got the expected result and insert it into Prometheus, but now it doesn't.(Even I've not changed the configuration)

After data was inserted into VM, the only query I do was query the new metric itself to check the recording rule result .

For instance, the count_over_time should never return 1,2,3,4 .... Since I have 100~1000 logs every day. Do you mean it is normal behavior?

What I want to say is that the query in Loki:

          count_over_time({job="wptiminglog"} |= "Open Analysis" [1d])

should be consistent with the metric in Prometheus:

   spotfire_log_count_1d

since it's generated by a recording ruler (Actually it does in the past months):

    rules:
      - record: spotfire_log_count_1d
        expr: |
          count_over_time({job="wptiminglog"} |= "Open Analysis" [1d])

dannykopping commented 2 years ago

What I'm saying is: you're not looking directly at the result of the Loki ruler. You're looking at VM, which is supposed to display the metrics correctly, but if there's a problem with VM then it has nothing to do with the ruler.

So like I said: first you'll need to follow the documentation to see if there were any issues with the ruler producing the samples or sending to VM. After that, I'd suggest checking the VM logs to make sure there isn't an issue there.

The ruler and the queriers do the same work, and since the queriers are producing the correct result I think there isn't enough evidence yet to suggest that the issue you're encountering is an issue originating with Loki.

zyxbest commented 2 years ago

Thanks, now I got you. You mean that

First, loki ruler got the correct data, waiting to send it into VM
There might be something wrong with VM, so the metrics could be inserted or stored incorrectly.

I would take a check, although these days the VM looks overall healthy.

dannykopping commented 2 years ago

I think it's possible, and we should rule it out first.

Also the metrics provided in the documentation link above will tell you if there were any issues sending metrics to VM.

zyxbest commented 2 years ago

@dannykopping Thanks for the suggestion, I've checked the metrics, and there is no unusual metric found: no failed samples, no lagged remote-write, and no corrupt WAL.

But the tenant significantly changed after the outage.

The screenshot shows rate of loki_ruler_wal_prometheus_remote_storage_samples_total .

Since I've set auth_enabled to false, the tenant is supposed to be fake, but now it's set dynamically with the datetime of the ruler pod starting, e.g. ..2022_09_05_02_40_29.890235257

Could this be the cause? My Ruler config is :

    ruler:
      enable_sharding: true
      wal:
        dir: /var/loki
      storage:
        type: local
        local:
          directory: /etc/loki/rules/fake
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/scratch

dannykopping commented 2 years ago

That's strange.

Please port-forward to a ruler pod and curl http://<host:port>/metrics and send me the output

zyxbest commented 2 years ago

@dannykopping Here is the output from 3100 port of ruler: ~removed~

dannykopping commented 2 years ago

OK thanks. I suspect this is because the tenant name from extracted from the file path, and the regex is somehow breaking.

dannykopping commented 2 years ago

I see which a lot of error logs:

loki_log_messages_total{level="error"} 9672

Other than that, everything looks healthy. No dropped samples, no remote-write errors

zyxbest commented 2 years ago

@dannykopping Does it represent the error log produced by the ruler itself? The only kind of error logs I see is like this:

level=warn ts=2022-09-05T09:07:49.337570275Z caller=manager.go:610 user=..data group=spotfire-test msg="Evaluating rule failed" rule="record: spotfire_log_count_1d\nexpr: count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])\n" err="wrong chunk metadata"
level=warn ts=2022-09-05T09:08:17.511942184Z caller=manager.go:610 user=..2022_09_05_08_44_52.194414241 group=spotfire-test msg="Evaluating rule failed" rule="record: spotfire_log_count_1d\nexpr: count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])\n" err="wrong chunk metadata"

which are quite a lot and keep increasing.

(I have searched this "wrong chunk metadata" a million times but not got the right solution)

dannykopping commented 2 years ago

Hhmm, that's odd. That means the ruler is not able to read the chunks correctly, I think. Can you please share your config?

zyxbest commented 2 years ago

@dannykopping Here is the whole Loki config:

auth_enabled: false

server:
  log_level: info
  http_listen_port: 3100
  http_server_read_timeout: 300s
  http_server_write_timeout: 300s
  grpc_server_max_recv_msg_size: 47185920
  grpc_server_max_send_msg_size: 47185920

distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  join_members:
    - loki-loki-distributed-memberlist

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 1
  chunk_idle_period: 1h
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_retain_period: 1m
  max_transfer_retries: 0
  chunk_target_size: 2621440
  max_chunk_age: 1h
  wal:
       dir: /var/loki/wal
querier:
  max_concurrent: 1000
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  ingestion_rate_mb: 30
  ingestion_burst_size_mb: 20
  max_concurrent_tail_requests: 200
  max_query_parallelism: 32
  max_query_series: 50000
  max_entries_limit_per_query: 2000000
  max_query_length: 0h
  max_streams_per_user: 0
  max_global_streams_per_user: 0
  split_queries_by_interval: 1440m
schema_config:
  configs:
    - from: 2020-09-07
      store: boltdb-shipper
      object_store: azure
      schema: v11
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  boltdb_shipper:
    shared_store: azure
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h
  azure:
    container_name: <replaced>
    account_name: <replaced>
    account_key: <replaced>
  index_queries_cache_config:
    memcached:
      batch_size: 10000
      parallelism: 10000
    memcached_client:
      consistent_hash: true
      host: loki-loki-distributed-memcached-index-queries
      service: http

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    memcached:
      batch_size: 10000
      parallelism: 10000
    memcached_client:
      consistent_hash: true
      host: loki-loki-distributed-memcached-chunks
      service: http

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

query_range:
  align_queries_with_step: true
  max_retries: 5
  cache_results: true
  results_cache:
    cache:
      #enable_fifocache: true
      #fifocache:
      # max_size_items: 1024
      # validity: 24h
      memcached:
        batch_size: 10000
        parallelism: 10000
        expiration: 24h
      memcached_client:
       consistent_hash: true
       host: loki-loki-distributed-memcached-frontend
       max_idle_conns: 16
       service: http
       timeout: 500ms
       update_interval: 1m

frontend_worker:
  frontend_address: loki-loki-distributed-query-frontend:9095
  grpc_client_config:
    max_send_msg_size: 47185920
    max_recv_msg_size: 47185920

frontend:
  log_queries_longer_than: 5s
  compress_responses: true
  max_outstanding_per_tenant: 10000

compactor:
  shared_store: filesystem

ruler:
  enable_sharding: true
  wal:
    dir: /var/loki
  storage:
    type: local
    local:
      directory: /etc/loki/rules/fake
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/scratch
  alertmanager_url: https://alertmanager.xx
  external_url: https://alertmanager.xx
  remote_write:
    enabled: true
    client:
      url: <replaced>

dannykopping commented 2 years ago

Thank you.

Hhmm, does the ruler have permission to access the Azure storage? I can't think of a reason why the Loki queriers would be returning legit results while the ruler doesn't - it's the same code.

zyxbest commented 2 years ago

does the ruler have permission to access the Azure storage?

Yes sure, since some other rules can be executed correctly. (I'll check the storage settings later.)

It's should be the same behavior. :(

I'm trying ways to get rid of it. Maybe days later I would update promtail to put the logs into a new label or create a new cluster.

grafana / loki

Loki ruler returns incorrect results after an outage of Kubernetes Cluster #7046

Log