Open sourcehawk opened 5 months ago
How many blocks do you have? Can you bump to 0.35.0?
How many blocks do you have? Can you bump to 0.35.0?
What do you mean by how many blocks?
Bumped thanos versions to 0.35.1, redeployed with all the stores set. The red line that is spiking in the graphs is caused by the EKS node which has storegateway on it:
I guess it's fetching block metadata on startup ( the gateway) does it stabilize eventually and is the querier less noisy?
It stays that high indefinitely. The fact that it "stabilizes" is not really a good thing when it stays at 500+MB/s bandwidth 😅
Storage gw is also 0.35.0 right?
Storage gw is also 0.35.0 right?
Yeah all thanos components are now 0.35.1
How many blocks do you have in object storage roughly? Is your compactor working well?
Compactor seems to be doing it's job quite well and there are less than 800 objects reported totaling little under 30GB of data.
@sourcehawk traffic between querier and store gateway is triggered by incoming queries. We can't say 500 MB/s is unnatural unless we know a bunch of things, like:
After much debugging we've come to realize the traffic is probably being generated due to an infinite recursion loop happening between the thanos ruler and the querier when the ruler is added as a store on the querier. My best bet is because the ruler queries the querier but the querier also queries the ruler, causing an infinite call loop to both the sidecar and the store gateway.
This is the network traffic after removing thanos ruler, as can be seen, the traffic of all types drops almost instantaneously to zero.
Interesting. You can deploy a separate querier that will query almost everything, excluding the Ruler, and point the Ruler to this one.
Just out of curiosity - if you'd enable remote_write on Ruler that would stop Store API on it - possibly that could help?
It would be great if someone could elaborate on whether or not the ruler was ever intended to be added as a store on the Querier or not. If not then I'll close this issue.
Can someone confirm this is correct there is a loop. i don't belive it can do the same query several times. But if there is infinite loop indeed, this is very expensive issue.
The confusing thing about this to me is that ruler should not ask the querier to answer queries, it should answer from its local tsdb really. Querier is only needed if it needs to evaluate rules which is a completely different concern. In theory it should not be possible to build a loop with query evaluation between ruler and querier.
@sourcehawk can you share your ruler configuration please?
I don't think it is really a loop. The queries originated from Ruler
to Querier and requests from Querier to Ruler are two different code path.
The request from Querier to Ruler should only query Ruler's local TSDB.
Do you have really heavy rules that queries long term data?
We are using kube-prometheus-stack and bitnami/thanos helm charts together. While writing it up I stumbled upon this particular section which looks quite suspicious:
thanosRuler:
thanosRulerSpec:
queryEndpoints:
- http://monitoring-thanos-query:9090
Although the description of that particular setting is the following
QueryEndpoints defines Thanos querier endpoints from which to query metrics. Maps to the --query flag of thanos ruler. queryEndpoints: []
which makes it sounds like what I configured (the url to the thanos querier) is the expected value. :shrug:
@sourcehawk did you get anywhere with this issue?
I am seeing very similar behaviour, excessive and growing network traffic on the querier, but i am not using Ruler and have a much different use case.
I have 3 prometheus instances setup to scrape their own AWS availability zone only, to reduce cross-AZ data transfer charges. These prometheus instances remote write to a central Mimir cluster which does the bulk of our metric querying. However I wanted to keep autoscaling metric queries from keda within the cluster to avoid going cross-region etc.
So I have thanos query deployed with the prometheus instances (and their thanos sidecar) as endpoints, the querier only receives autoscaling queries which are instant queries of the form:
sum(
sum by (namespace, pod, container) (irate(container_cpu_usage_seconds_total{job="kubelet", image!="", namespace="$NAMESPACE",container="$CONTAINER_NAME"}[1m]))
/ on (namespace,pod,container) group_left(resource)
min(kube_pod_container_resource_limits{resource="memory"}) by (namespace,pod,container,resource)
) * 100
The kube_pod_container_resource_limits
series exists on 1 of the prometheus instances and the relevant container_cpu_usage_seconds_total
series are (potentially) spread across all 3 .
The response to these queries is ~130 bytes.
In the particular environment I am debugging I'm seeing inbound network bandwidth to the thanos querier pods of over 6MB/s which is 2x the total remote_write outbound bandwidth. This seems very wrong.
I duplicated my thanos querier deployment to isolate it. Manually running one of the autoscaling queries above once against a single querier pod resulted in a network graph like this
However... if i modify the query to include the {namespace="$NAMESPACE",container="$CONTAINER_NAME"}
selectors on the kube_pod_container_resource_limits
part of the query as well, I get this:
Rolling out this query change in a staging environment dropped thanos-querier ingress from ~6MB/s to 35KB/s.
Is the thanos-querier trying to stream all the series' in kube_pod_container_resource_limits
everytime a query is run?
Is it possible you have a recording rule or something thats also causing this kind of pathological behaviour in the querier?
Is it viable for the querier to optimise join queries like this?
@hamishforbes Interesting... I didn't find a solution, but the problem disappeared after I disabled the thanos ruler deployment provided by the kube-prometheus-stack helm chart, as I mentioned in an earlier comment. I am using kube-prometheus-stack helm chart's default ruleset in my cluster. Your problem seems to be quite small compared to what I was seeing, but I also likely have a much larger set of rules than you do here.
I'd say it's quite clear that this is being caused by recording rules since the Thanos Ruler is the one that manages them, and I saw a drop from maxed out bandwidth on AWS instances to under 1MB/s instantly when removing Thanos Ruler. I am quite sure that if I was running larger instances on AWS, it still would have maxed out my bandwidth at some point.
I think the Thanos engine has an optimizer that sets the same matchers on both sides of the binary expression. Maybe that is broken for that query somehow. Are you using the Thanos engine?
I think the Thanos engine has an optimizer that sets the same matchers on both sides of the binary expression. Maybe that is broken for that query somehow. Are you using the Thanos engine?
No I was using the default engine, haven't tested with the Thanos engine. Ill give that a go and see what happens
I'm hitting the same problem with Ruler.
I'm on Thanos 0.36.1 with EKS v1.30.
The cross-az network traffic with Ruler enabled was something like 200MB/s on the Store Gateway!
AFAIK the documentation states that Ruler should be able to contact Alertmanagers, Query and S3 bucket, so it should be the expected setup:
Nevertheless, as soon as I disabled the Ruler the traffic instantly dropped, so it's clear that Ruler is bombing the Store somehow with network traffic:
I've also tried to configure Ruler with Stateless remote write as shown here but the network traffic to Store Gateway does not improve. Here is the config I use on the Ruler in the kube-prometheus-stack chart
thanosRuler:
enabled: true
thanosRulerSpec:
additionalArgs:
- name: remote-write.config
value: |
remote_write:
- url: http://thanos-receive:19291/api/v1/receive
remote_timeout: 30s
follow_redirects: true
@irizzant Does it happen with an empty rules file for ruler too?
@MichaHoffmann I use kube-prometheus-stack chart which creates a bunch of Prometheus rules but they're not empty
@MichaHoffmann I use kube-prometheus-stack chart which creates a bunch of Prometheus rules but they're not empty
Does it also happen with an empty rules file? If not ~ can you maybe bisect the rules and see if some rules cause this?
FWIW i did eventually get around to enabling the Thanos query engine on our querier, which did seem to resolve the issue for my specific query. Definitely worth trying
I think I found the problem at least in my case. I tried to disable all the kube-prometheus-stack rules and activate them one by one while monitoring network traffic.
What caused the traffic spike was enabling the kube-prometheus-stack for API server, which created this huge spike in Store Gateway:
Specifically the rule that increased the traffic was kube-prometheus-stack-kube-apiserver-burnrate.rules.
I was able to enable all the other rules without traffic spikes.
At a guess they probably fetch a lot of data from storage gateway, burnrate sounds like a thing that fetches a month or so.
@irizzant The most expensive rules from kube-prometheus-stack-kube-apiserver-burnrate.rules
seems a rule that looks back for 3 days. This could be indeed expensive for store gateway as it tries to fetch the same index and chunks over and over again. Even if you have remote chunks cache you still need to consume bandwidth to download the cached data. What might help you is to use Redis chunks cache as it supports client side caching.
Do you configure to only query store gateways after certain time? Let's say 24h so that most of the queries from kube-prometheus-stack-kube-apiserver-burnrate.rules
should try to only query your hot store (either sidecar or receiver)
@yeya24 can you please detail the following? I'm not sure I understand where to check and how to set it up
Do you configure to only query store gateways after certain time? Let's say 24h so that most of the queries from kube-prometheus-stack-kube-apiserver-burnrate.rules should try to only query your hot store (either sidecar or receiver)
Thanos, Prometheus and Golang version used: Thanos: 0.34.1 Prometheus: v2.51.0 Golang: No Idea, running in containers on EKS
Object Storage Provider: S3
What happened: Thanos Querier causes insane network traffic. By insane I mean up to half a gigabyte of network bandwidth PER SECOND. I do not think anything but an infinite recursion loop could explain such amounts of network usage. Thanos Querier is currently the largest cost factor of our EKS environment, costing more than the entire compute infrastructure due to this network bandwidth. This is currently affecting all of the clusters we have the monitoring stack deployed on.
Here's an image depicting the network usage over a span of a few days. The leftmost graph showing the bandwidth of two EKS nodes totaling over 600MB/s inbound network bandwidth with a very weird pattern. At the same time over 150MB/s outbound traffic and 200K packets being sent every second. At the end of the timeline I scaled querier deployment to 0, indicating that thanos querier is the sole culprit of this insane network bandwidth usage.
In the following configuration extracted from my querier pod definition, you can see that I have stores for thanos sidecar (thanos-discovery), storegateway, receive and ruler.
The rise in network traffic on the graph below happens when I scaled up thanos querier from 0 to 1. One thing to note on the leftmost graph below is that there are two instances with high traffic (network in - bytes), one of those is running thanos storegateway and the other is running thanos querier.
Now if I remove the storegateway from the list of endpoints on the querier, my network traffic (network in) is reduced by alot, and only one EKS node reports high inbound bandwidth, namely the one running querier. The outbound traffic is coming from the same instance as where thanos sidecar pod resides (prometheus deployment).
Note that I also tried removing all the endpoints except for
--endpoint=monitoring-thanos-discovery:10901
which led to no change from the above graph. This means the traffic is solely being generated by--endpoint=monitoring-thanos-discovery:10901
and--endpoint=monitoring-thanos-storegateway:10901
in the previous images.Is thanos sidecar producing 80MB/s data? That is simply not possible... that would mean I am generating 288GB of data on sidecar every hour, which is 6.9Terabytes per day. The instances don't even have the disk capacity to store one hour of data of that amount. I am not even talking about the amount of network bandwidth that is being used when storegateway is specified as a store on the querier - at which point the inbound network usage could be 43 Terabytes per day.
Where is this traffic coming from? Is this amount of network traffic expected?
What you expected to happen: The network traffic to be a reasonable few megabytes per second.
How to reproduce it (as minimally and precisely as possible): Not sure
Full logs to relevant components:
Sidecar pod logs:
Thanos querier logs:
Anything else we need to know: