Open bwplotka opened 2 years ago
The configuration of Thanos Ruler:
` - args:
There was no change in the config. Just the version change from v0.25.2 to 0.24.0 fixed the problem.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind
command if you wish to be reminded at some point in future.
I am encountering a similar issue since upgrading to v0.28.1. Many rules are failing to be evaluated with ruler with the error no query API server reachable
, was this issue ever resolved? @bwplotka @yeya24
I'm seeing the same issue after upgrading to v0.29.0, but couple of findings that I have is, when we have the targets around 4k+ its working fine, where as if targets were increased to 24k we are running into this error "No query API server reachable"
Additional info from the logs are,
LabelSets: Mint: -62167219200000 Maxt: 9223372036854775807: rpc error: code = Unknown desc = query Prometheus: request failed with code 503 Service Unavailable; msg Service Unavailable\"}
Also, can someone help me in understanding if all rules are being executed simultaneously?
Hi Team, 5903 as per the suggestion, we have upgraded to 0.29.0, since then we are seeing this issue, is there any workaround or could you please help on how to deal about this issue? Thanks in advance!
@bwplotka Can I know if this issue is addressed in version 0.30.0? or any pointers on this issue would be helpful. Thanks in advance!!
@bwplotka we have the same problem with 0.30.0 ruler. We deploy it with the thanosruler crd and use the dnssrv record discovery in kubernetes.
@bwplotka it looks like that partial_response_strategy
needs to be enabled for ruler rules now.
As without that specific flag it is not possible to query with missing stores as it returns errors.
Hey @Cellebyte I'm having the same issue, could you clarify better how you fixed it?
As per Thanos documentation: "It is recommended to keep partial response as abort for alerts and that is the default as well."
What exactly did you enable and how? I'm using ThanosRuler CRD if that helps
@Migueljfs you need to set it to partial_response_strategy: "warn"
because ruler will fail if one of the storeAPIs of your querier is not reachable or does not answer to the ruler rule request.
We are covering the problem which is mentioned above by an additional alert which checks if our remote query is reachable by using vector(0) or the up metric for the remote cluster.
We have identified the issue, in our case looks like issue was with one of the prometheus shard, which has used up all the memory and was not responding, on cleaning up of data, which is removing WAL, head_chunks and TSDB ( it may cause data loss) and bringing up the shards clean, it started working.
did anyone get a fix for the above issue? I have set partial_response_strategy: "warn"
in my rules file but still I get the same error as "no query API server reachable".
Below is the command I have used to bring up my ruler.
/bin/thanos rule --data-dir /var/lib/prometheus-ruler/ --eval-interval 30s --rule-file /etc/prometheus/alert/*.yml --alert.query-url http:/<prom-server-1>:9090 --alertmanagers.url http://localhost:9093 --objstore.config-file /etc/prometheus/bucket.yml --query http://<prom-server-1>:129090 --query http://<prom-server-2>:29090 --label 'monitor_cluster="eu1"' --label 'replica="prom-server101"'
Can someone help with this issue? or any other version of thanos handling this error?
having similar issue running thanos v0.31.0
via ThanosRuler
CRD (prometheus operator).
I changed --query
value from load balancer (with Thanos Query as a endpoints) to direct Thanos Queries endpoint names. The problem disappeared immediately :)
@bwplotka hey 👋 I ran into this issue today as well. I can normally resolve the thanos query address from within thanos ruler container. I am using v0.29 Thanos version via Prometheus operator as well, the configuration seems to be passed correctly to thanos. Any clues or hints on what might be the issue? Thanks!
One user shared that our Rulers were having hiccups with finding the right Qurier endpoints resulting in gaps:
Apparently reverting to v0.24.0 resolved the issue. This seems to be a stateful Ruler.
We will need to have more information e.g: