Open sourcehawk opened 3 weeks ago
How many blocks do you have? Can you bump to 0.35.0?
How many blocks do you have? Can you bump to 0.35.0?
What do you mean by how many blocks?
Bumped thanos versions to 0.35.1, redeployed with all the stores set. The red line that is spiking in the graphs is caused by the EKS node which has storegateway on it:
I guess it's fetching block metadata on startup ( the gateway) does it stabilize eventually and is the querier less noisy?
It stays that high indefinitely. The fact that it "stabilizes" is not really a good thing when it stays at 500+MB/s bandwidth 😅
Storage gw is also 0.35.0 right?
Storage gw is also 0.35.0 right?
Yeah all thanos components are now 0.35.1
How many blocks do you have in object storage roughly? Is your compactor working well?
Compactor seems to be doing it's job quite well and there are less than 800 objects reported totaling little under 30GB of data.
@sourcehawk traffic between querier and store gateway is triggered by incoming queries. We can't say 500 MB/s is unnatural unless we know a bunch of things, like:
After much debugging we've come to realize the traffic is probably being generated due to an infinite recursion loop happening between the thanos ruler and the querier when the ruler is added as a store on the querier. My best bet is because the ruler queries the querier but the querier also queries the ruler, causing an infinite call loop to both the sidecar and the store gateway.
This is the network traffic after removing thanos ruler, as can be seen, the traffic of all types drops almost instantaneously to zero.
Interesting. You can deploy a separate querier that will query almost everything, excluding the Ruler, and point the Ruler to this one.
Just out of curiosity - if you'd enable remote_write on Ruler that would stop Store API on it - possibly that could help?
It would be great if someone could elaborate on whether or not the ruler was ever intended to be added as a store on the Querier or not. If not then I'll close this issue.
Thanos, Prometheus and Golang version used: Thanos: 0.34.1 Prometheus: v2.51.0 Golang: No Idea, running in containers on EKS
Object Storage Provider: S3
What happened: Thanos Querier causes insane network traffic. By insane I mean up to half a gigabyte of network bandwidth PER SECOND. I do not think anything but an infinite recursion loop could explain such amounts of network usage. Thanos Querier is currently the largest cost factor of our EKS environment, costing more than the entire compute infrastructure due to this network bandwidth. This is currently affecting all of the clusters we have the monitoring stack deployed on.
Here's an image depicting the network usage over a span of a few days. The leftmost graph showing the bandwidth of two EKS nodes totaling over 600MB/s inbound network bandwidth with a very weird pattern. At the same time over 150MB/s outbound traffic and 200K packets being sent every second. At the end of the timeline I scaled querier deployment to 0, indicating that thanos querier is the sole culprit of this insane network bandwidth usage.
In the following configuration extracted from my querier pod definition, you can see that I have stores for thanos sidecar (thanos-discovery), storegateway, receive and ruler.
The rise in network traffic on the graph below happens when I scaled up thanos querier from 0 to 1. One thing to note on the leftmost graph below is that there are two instances with high traffic (network in - bytes), one of those is running thanos storegateway and the other is running thanos querier.
Now if I remove the storegateway from the list of endpoints on the querier, my network traffic (network in) is reduced by alot, and only one EKS node reports high inbound bandwidth, namely the one running querier. The outbound traffic is coming from the same instance as where thanos sidecar pod resides (prometheus deployment).
Note that I also tried removing all the endpoints except for
--endpoint=monitoring-thanos-discovery:10901
which led to no change from the above graph. This means the traffic is solely being generated by--endpoint=monitoring-thanos-discovery:10901
and--endpoint=monitoring-thanos-storegateway:10901
in the previous images.Is thanos sidecar producing 80MB/s data? That is simply not possible... that would mean I am generating 288GB of data on sidecar every hour, which is 6.9Terabytes per day. The instances don't even have the disk capacity to store one hour of data of that amount. I am not even talking about the amount of network bandwidth that is being used when storegateway is specified as a store on the querier - at which point the inbound network usage could be 43 Terabytes per day.
Where is this traffic coming from? Is this amount of network traffic expected?
What you expected to happen: The network traffic to be a reasonable few megabytes per second.
How to reproduce it (as minimally and precisely as possible): Not sure
Full logs to relevant components:
Sidecar pod logs:
Thanos querier logs:
Anything else we need to know: