Ruler: v0.25.2 no query API server unreachable

bwplotka commented 2 years ago

One user shared that our Rulers were having hiccups with finding the right Qurier endpoints resulting in gaps:

Apparently reverting to v0.24.0 resolved the issue. This seems to be a stateful Ruler.

We will need to have more information e.g:

what was reverted - only ruler version or anything else?
What's the configuration of the mentioned ruler?

sharathfeb12 commented 2 years ago

The configuration of Thanos Ruler:

` - args:

rule
--log.level=debug
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--objstore.config=$(OBJSTORE_CONFIG)
--data-dir=/thanos/data
--eval-interval=2m
--label=rule_replica="$(NAME)"
--alert.label-drop=rule_replica
--remote-write.config-file=/etc/thanos/conf/rw-config.yaml
--query=dnssrv+_http._tcp.observatorium-thanos-query-frontend.monitoring.svc.cluster.local
--rule-file=/etc/thanos/rules//.yaml`

There was no change in the config. Just the version change from v0.25.2 to 0.24.0 fixed the problem.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

RohitKochhar commented 1 year ago

I am encountering a similar issue since upgrading to v0.28.1. Many rules are failing to be evaluated with ruler with the error no query API server reachable, was this issue ever resolved? @bwplotka @yeya24

daganibhanu commented 1 year ago

I'm seeing the same issue after upgrading to v0.29.0, but couple of findings that I have is, when we have the targets around 4k+ its working fine, where as if targets were increased to 24k we are running into this error "No query API server reachable"

Additional info from the logs are,

LabelSets: Mint: -62167219200000 Maxt: 9223372036854775807: rpc error: code = Unknown desc = query Prometheus: request failed with code 503 Service Unavailable; msg Service Unavailable\"}

Also, can someone help me in understanding if all rules are being executed simultaneously?

daganibhanu commented 1 year ago

Hi Team, 5903 as per the suggestion, we have upgraded to 0.29.0, since then we are seeing this issue, is there any workaround or could you please help on how to deal about this issue? Thanks in advance!

daganibhanu commented 1 year ago

@bwplotka Can I know if this issue is addressed in version 0.30.0? or any pointers on this issue would be helpful. Thanks in advance!!

Cellebyte commented 1 year ago

@bwplotka we have the same problem with 0.30.0 ruler. We deploy it with the thanosruler crd and use the dnssrv record discovery in kubernetes.

Cellebyte commented 1 year ago

@bwplotka it looks like that partial_response_strategy needs to be enabled for ruler rules now. As without that specific flag it is not possible to query with missing stores as it returns errors.

Migueljfs commented 1 year ago

Hey @Cellebyte I'm having the same issue, could you clarify better how you fixed it?

As per Thanos documentation: "It is recommended to keep partial response as abort for alerts and that is the default as well."

What exactly did you enable and how? I'm using ThanosRuler CRD if that helps

Cellebyte commented 1 year ago

@Migueljfs you need to set it to partial_response_strategy: "warn" because ruler will fail if one of the storeAPIs of your querier is not reachable or does not answer to the ruler rule request.

Cellebyte commented 1 year ago

We are covering the problem which is mentioned above by an additional alert which checks if our remote query is reachable by using vector(0) or the up metric for the remote cluster.

daganibhanu commented 1 year ago

We have identified the issue, in our case looks like issue was with one of the prometheus shard, which has used up all the memory and was not responding, on cleaning up of data, which is removing WAL, head_chunks and TSDB ( it may cause data loss) and bringing up the shards clean, it started working.

sunilnerella commented 1 year ago

did anyone get a fix for the above issue? I have set partial_response_strategy: "warn" in my rules file but still I get the same error as "no query API server reachable". Below is the command I have used to bring up my ruler. /bin/thanos rule --data-dir /var/lib/prometheus-ruler/ --eval-interval 30s --rule-file /etc/prometheus/alert/*.yml --alert.query-url http:/<prom-server-1>:9090 --alertmanagers.url http://localhost:9093 --objstore.config-file /etc/prometheus/bucket.yml --query http://<prom-server-1>:129090 --query http://<prom-server-2>:29090 --label 'monitor_cluster="eu1"' --label 'replica="prom-server101"'

Can someone help with this issue? or any other version of thanos handling this error?

zbialik commented 10 months ago

having similar issue running thanos v0.31.0 via ThanosRuler CRD (prometheus operator).

LukaszWasko commented 8 months ago

I changed --query value from load balancer (with Thanos Query as a endpoints) to direct Thanos Queries endpoint names. The problem disappeared immediately :)

lilic commented 1 month ago

@bwplotka hey 👋 I ran into this issue today as well. I can normally resolve the thanos query address from within thanos ruler container. I am using v0.29 Thanos version via Prometheus operator as well, the configuration seems to be passed correctly to thanos. Any clues or hints on what might be the issue? Thanks!

thanos-io / thanos

Ruler: v0.25.2 no query API server unreachable #5321