Open zyxbest opened 2 years ago
Thank you for the very detailed description of the problem @qrr1995!
Was your prometheus server also impacted by the same outage?
The ruler acts like a querier, so whatever query results you're seeing by querying Loki directly should be identical to the ruler.
https://grafana.com/docs/loki/latest/operations/recording-rules/#observability Please consult this documentation to see if any of those metrics show any problems.
@dannykopping Thanks for your reply.
My Prometheus server (which is actually Victoriametrics) also suffered the outage and get recovered later.
That's my question why the query behaves differently between ruler and Grafana Explorer, does it look like a bug?
Since the ruler works perfectly in the last several months before the outage, the query results were also identical.
That's my question why the query behaves differently between ruler and Grafana Explorer, does it look like a bug?
Well, you're not comparing the ruler and your Loki queries - you're comparing the metrics in VM and Loki.
@dannykopping Sorry, I don't really get it, cuz in my early experience, the Loki ruler always got the expected result and insert it into Prometheus, but now it doesn't.(Even I've not changed the configuration)
After data was inserted into VM, the only query I do was query the new metric itself to check the recording rule result .
For instance, the count_over_time should never return 1,2,3,4 .... Since I have 100~1000 logs every day. Do you mean it is normal behavior?
What I want to say is that the query in Loki:
count_over_time({job="wptiminglog"} |= "Open Analysis" [1d])
should be consistent with the metric in Prometheus:
spotfire_log_count_1d
since it's generated by a recording ruler (Actually it does in the past months):
rules:
- record: spotfire_log_count_1d
expr: |
count_over_time({job="wptiminglog"} |= "Open Analysis" [1d])
What I'm saying is: you're not looking directly at the result of the Loki ruler. You're looking at VM, which is supposed to display the metrics correctly, but if there's a problem with VM then it has nothing to do with the ruler.
So like I said: first you'll need to follow the documentation to see if there were any issues with the ruler producing the samples or sending to VM. After that, I'd suggest checking the VM logs to make sure there isn't an issue there.
The ruler and the queriers do the same work, and since the queriers are producing the correct result I think there isn't enough evidence yet to suggest that the issue you're encountering is an issue originating with Loki.
Thanks, now I got you. You mean that
I would take a check, although these days the VM looks overall healthy.
I think it's possible, and we should rule it out first.
Also the metrics provided in the documentation link above will tell you if there were any issues sending metrics to VM.
@dannykopping Thanks for the suggestion, I've checked the metrics, and there is no unusual metric found: no failed samples, no lagged remote-write, and no corrupt WAL.
But the tenant
significantly changed after the outage.
The screenshot shows rate of loki_ruler_wal_prometheus_remote_storage_samples_total
.
Since I've set auth_enabled
to false
, the tenant
is supposed to be fake
, but now it's set dynamically with the datetime of the ruler pod starting, e.g. ..2022_09_05_02_40_29.890235257
Could this be the cause? My Ruler config is :
ruler:
enable_sharding: true
wal:
dir: /var/loki
storage:
type: local
local:
directory: /etc/loki/rules/fake
ring:
kvstore:
store: memberlist
rule_path: /tmp/scratch
That's strange.
Please port-forward to a ruler pod and curl http://<host:port>/metrics
and send me the output
@dannykopping Here is the output from 3100 port of ruler: ~removed~
OK thanks. I suspect this is because the tenant name from extracted from the file path, and the regex is somehow breaking.
I see which a lot of error logs:
loki_log_messages_total{level="error"} 9672
Other than that, everything looks healthy. No dropped samples, no remote-write errors
@dannykopping Does it represent the error log produced by the ruler itself? The only kind of error logs I see is like this:
level=warn ts=2022-09-05T09:07:49.337570275Z caller=manager.go:610 user=..data group=spotfire-test msg="Evaluating rule failed" rule="record: spotfire_log_count_1d\nexpr: count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])\n" err="wrong chunk metadata"
level=warn ts=2022-09-05T09:08:17.511942184Z caller=manager.go:610 user=..2022_09_05_08_44_52.194414241 group=spotfire-test msg="Evaluating rule failed" rule="record: spotfire_log_count_1d\nexpr: count_over_time({job=\"wptiminglog\"} |= \"Open Analysis\"[1d])\n" err="wrong chunk metadata"
which are quite a lot and keep increasing.
(I have searched this "wrong chunk metadata" a million times but not got the right solution)
Hhmm, that's odd. That means the ruler is not able to read the chunks correctly, I think. Can you please share your config?
@dannykopping Here is the whole Loki config:
auth_enabled: false
server:
log_level: info
http_listen_port: 3100
http_server_read_timeout: 300s
http_server_write_timeout: 300s
grpc_server_max_recv_msg_size: 47185920
grpc_server_max_send_msg_size: 47185920
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- loki-loki-distributed-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
chunk_idle_period: 1h
chunk_block_size: 262144
chunk_encoding: snappy
chunk_retain_period: 1m
max_transfer_retries: 0
chunk_target_size: 2621440
max_chunk_age: 1h
wal:
dir: /var/loki/wal
querier:
max_concurrent: 1000
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
ingestion_rate_mb: 30
ingestion_burst_size_mb: 20
max_concurrent_tail_requests: 200
max_query_parallelism: 32
max_query_series: 50000
max_entries_limit_per_query: 2000000
max_query_length: 0h
max_streams_per_user: 0
max_global_streams_per_user: 0
split_queries_by_interval: 1440m
schema_config:
configs:
- from: 2020-09-07
store: boltdb-shipper
object_store: azure
schema: v11
index:
prefix: loki_index_
period: 24h
storage_config:
boltdb_shipper:
shared_store: azure
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 168h
azure:
container_name: <replaced>
account_name: <replaced>
account_key: <replaced>
index_queries_cache_config:
memcached:
batch_size: 10000
parallelism: 10000
memcached_client:
consistent_hash: true
host: loki-loki-distributed-memcached-index-queries
service: http
chunk_store_config:
max_look_back_period: 0s
chunk_cache_config:
memcached:
batch_size: 10000
parallelism: 10000
memcached_client:
consistent_hash: true
host: loki-loki-distributed-memcached-chunks
service: http
table_manager:
retention_deletes_enabled: false
retention_period: 0s
query_range:
align_queries_with_step: true
max_retries: 5
cache_results: true
results_cache:
cache:
#enable_fifocache: true
#fifocache:
# max_size_items: 1024
# validity: 24h
memcached:
batch_size: 10000
parallelism: 10000
expiration: 24h
memcached_client:
consistent_hash: true
host: loki-loki-distributed-memcached-frontend
max_idle_conns: 16
service: http
timeout: 500ms
update_interval: 1m
frontend_worker:
frontend_address: loki-loki-distributed-query-frontend:9095
grpc_client_config:
max_send_msg_size: 47185920
max_recv_msg_size: 47185920
frontend:
log_queries_longer_than: 5s
compress_responses: true
max_outstanding_per_tenant: 10000
compactor:
shared_store: filesystem
ruler:
enable_sharding: true
wal:
dir: /var/loki
storage:
type: local
local:
directory: /etc/loki/rules/fake
ring:
kvstore:
store: memberlist
rule_path: /tmp/scratch
alertmanager_url: https://alertmanager.xx
external_url: https://alertmanager.xx
remote_write:
enabled: true
client:
url: <replaced>
Thank you.
Hhmm, does the ruler have permission to access the Azure storage? I can't think of a reason why the Loki queriers would be returning legit results while the ruler doesn't - it's the same code.
does the ruler have permission to access the Azure storage?
Yes sure, since some other rules can be executed correctly. (I'll check the storage settings later.)
It's should be the same behavior. :(
I'm trying ways to get rid of it. Maybe days later I would update promtail to put the logs into a new label or create a new cluster.
Describe the bug After an outage of Kubernetes, the recording rule of Loki Ruler returns weird incorrect query results which are different from the ones on Grafana "Explore" page.
To Reproduce Steps to reproduce the behavior:
A working distributed Loki cluster v2.5.0
config a simple recording rule as an example
The ruler would run the query in Loki and send the result to Prometheus.
Run the raw query in Loki.
Run the new record query in Prometheus
Compare the results of step3 and step4, the raw query returns correct results while the new record from the ruler returns values as a series of integers 1,2,3,4,5......
other actions I have tried but don't work:
Expected behavior The new record metrics in Prometheus should be the same as the ones in Loki, the value should be a larger count, not 1,2,3,4,5...
Environment:
Screenshots, Promtail config, or terminal output
Log
sometimes there are error logs with
"wrong chunk metadata"
(I'm sure that I put the config into the right folder, and this error just occurs a little)