thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.64k stars 2.02k forks source link

0.35: Panic with query mode distributed #7328

Open jkroepke opened 2 weeks ago

jkroepke commented 2 weeks ago

Thanos, Prometheus and Golang version used: 0.35.0

Object Storage Provider: Azure Storage Account

What happened:

After enable query.mode=distributed my querier get a lot of panics. Removing --query.mode=distributed stops all pancis

What you expected to happen:

No panics

How to reproduce it (as minimally and precisely as possible):

At the moment, I'm unable to provide a minimal reproducible environment. However, according ruler logs, all queries like absent(up{job=\"kube-proxy\"} == 1) (job can have any label) seems affected.

We are using stateless rulers and thanos receive, no sidecars.

Thanos querier arguments:


query --log.level=info --log.format=json --grpc-address=0.0.0.0:10901 --http-address=0.0.0.0:10902 --query.replica-label=replica --query.replica-label=prometheus_replica --query.replica-label=thanos_receive_replica --query.replica-label=thanos_ruler_replica --endpoint=opsstack-thanos-storegateway.opsstack.svc.cluster.local.:10901 --endpoint=opsstack-thanos-receive.opsstack.svc.cluster.local.:10901 --alert.query-url=http://opsstack-thanos-query.opsstack.svc.cluster.local:10902 --enable-auto-gomemlimit --query.promql-engine=thanos --query.mode=distributed --query.auto-downsampling --query.default-tenant-id=opsstack --web.disable-cors --web.prefix-header=X-Forwarded-Prefix

Full logs to relevant components:

Uncomment if you would like to post collapsible logs:

Ruler Logs

Just a few, they are repeating ``` {"caller":"rule.go:968","component":"rules","err":"rpc error: code = Internal desc = runtime error: index out of range [0] with length 0","level":"error","query":"absent(up{job=\"kube-proxy\"} == 1)","ts":"2024-05-02T13:03:09.954559928Z"} {"caller":"rule.go:938","component":"rules","err":"read query instant response: perform POST request against http://opsstack-thanos-query.opsstack.svc.cluster.local:10902/api/v1/query: Post \"http://opsstack-thanos-query.opsstack.svc.cluster.local:10902/api/v1/query\": EOF","level":"error","query":"absent(up{job=\"apiserver\"} == 1)","ts":"2024-05-02T12:59:26.372712478Z"} ```

Querier Logs

https://gist.github.com/jkroepke/9fc58319bf819866138a8dae4f1c8d92

Anything else we need to know:

Environment:

MichaHoffmann commented 2 weeks ago

The issue here was that the querier is not configured to point at query APIs but at store APIs ( we should guard against that better probably ); this leads to the promql-engine distributing to 0 other engines, which exposes a bug where we dont guard against that!

Edit: see https://cloud-native.slack.com/archives/CK5RSSC10/p1714654267669009