thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Query: Network bandwidth usage upward of 500MB/s between Querier and configured stores #7417

Open sourcehawk opened 3 weeks ago

sourcehawk commented 3 weeks ago

Thanos, Prometheus and Golang version used: Thanos: 0.34.1 Prometheus: v2.51.0 Golang: No Idea, running in containers on EKS

Object Storage Provider: S3

What happened: Thanos Querier causes insane network traffic. By insane I mean up to half a gigabyte of network bandwidth PER SECOND. I do not think anything but an infinite recursion loop could explain such amounts of network usage. Thanos Querier is currently the largest cost factor of our EKS environment, costing more than the entire compute infrastructure due to this network bandwidth. This is currently affecting all of the clusters we have the monitoring stack deployed on.

Here's an image depicting the network usage over a span of a few days. The leftmost graph showing the bandwidth of two EKS nodes totaling over 600MB/s inbound network bandwidth with a very weird pattern. At the same time over 150MB/s outbound traffic and 200K packets being sent every second. At the end of the timeline I scaled querier deployment to 0, indicating that thanos querier is the sole culprit of this insane network bandwidth usage.

image

In the following configuration extracted from my querier pod definition, you can see that I have stores for thanos sidecar (thanos-discovery), storegateway, receive and ruler.

Args:
      query
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --query.replica-label=replica
      --endpoint=monitoring-thanos-discovery:10901
      --endpoint=monitoring-thanos-storegateway:10901
      --endpoint=monitoring-thanos-receive:10901
      --endpoint=thanos-ruler-operated:10901

The rise in network traffic on the graph below happens when I scaled up thanos querier from 0 to 1. One thing to note on the leftmost graph below is that there are two instances with high traffic (network in - bytes), one of those is running thanos storegateway and the other is running thanos querier.

image

Now if I remove the storegateway from the list of endpoints on the querier, my network traffic (network in) is reduced by alot, and only one EKS node reports high inbound bandwidth, namely the one running querier. The outbound traffic is coming from the same instance as where thanos sidecar pod resides (prometheus deployment).

image

Note that I also tried removing all the endpoints except for --endpoint=monitoring-thanos-discovery:10901 which led to no change from the above graph. This means the traffic is solely being generated by --endpoint=monitoring-thanos-discovery:10901 and --endpoint=monitoring-thanos-storegateway:10901 in the previous images.

Is thanos sidecar producing 80MB/s data? That is simply not possible... that would mean I am generating 288GB of data on sidecar every hour, which is 6.9Terabytes per day. The instances don't even have the disk capacity to store one hour of data of that amount. I am not even talking about the amount of network bandwidth that is being used when storegateway is specified as a store on the querier - at which point the inbound network usage could be 43 Terabytes per day.

Where is this traffic coming from? Is this amount of network traffic expected?

What you expected to happen: The network traffic to be a reasonable few megabytes per second.

How to reproduce it (as minimally and precisely as possible): Not sure

Full logs to relevant components:

Sidecar pod logs:

``` [me@work:~/Documents/work/devops/ci/setup]$ kubectl logs prometheus-monitoring-prometheus-0 -c thanos-sidecar -n monitoring ts=2024-06-05T14:51:47.754630201Z caller=options.go:26 level=info protocol=gRPC msg="disabled TLS, key and cert must be set to enable" ts=2024-06-05T14:51:47.757884479Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-06-05T14:51:47.761733659Z caller=sidecar.go:383 level=info msg="starting sidecar" ts=2024-06-05T14:51:47.762002683Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy ts=2024-06-05T14:51:47.762034728Z caller=http.go:73 level=info service=http/server component=sidecar msg="listening for requests and metrics" address=:10902 ts=2024-06-05T14:51:47.762976572Z caller=tls_config.go:274 level=info service=http/server component=sidecar msg="Listening on" address=[::]:10902 ts=2024-06-05T14:51:47.765306177Z caller=tls_config.go:277 level=info service=http/server component=sidecar msg="TLS is disabled." http2=false address=[::]:10902 ts=2024-06-05T14:51:47.765901598Z caller=reloader.go:238 level=info component=reloader msg="nothing to be watched" ts=2024-06-05T14:51:47.766609856Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready ts=2024-06-05T14:51:47.766776773Z caller=grpc.go:131 level=info service=gRPC/server component=sidecar msg="listening for serving gRPC" address=:10901 ts=2024-06-05T14:51:47.771849892Z caller=sidecar.go:195 level=info msg="successfully loaded prometheus version" ts=2024-06-05T14:51:47.831208649Z caller=sidecar.go:217 level=info msg="successfully loaded prometheus external labels" external_labels="{cluster=\"staging\", prometheus=\"monitoring/monitoring-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-0\", region=\"eu-west-1\", stage=\"staging\"}" ts=2024-06-05T14:51:49.767200216Z caller=shipper.go:263 level=warn msg="reading meta file failed, will override it" err="failed to read /prometheus/thanos.shipper.json: open /prometheus/thanos.shipper.json: no such file or directory" ts=2024-06-05T17:51:49.951882756Z caller=shipper.go:361 level=info msg="upload new block" id=01HZMRE5DMF0X8Z2RP537KP904 ```

Thanos querier logs:

``` kubectl logs monitoring-thanos-query-5c9b7d4d56-qslhc -n monitoring ts=2024-06-05T17:49:22.797512193Z caller=options.go:26 level=info protocol=gRPC msg="disabled TLS, key and cert must be set to enable" ts=2024-06-05T17:49:22.798966265Z caller=query.go:813 level=info msg="starting query node" ts=2024-06-05T17:49:22.799653428Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy ts=2024-06-05T17:49:22.799682866Z caller=http.go:73 level=info service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902 ts=2024-06-05T17:49:22.800013879Z caller=tls_config.go:274 level=info service=http/server component=query msg="Listening on" address=[::]:10902 ts=2024-06-05T17:49:22.8000369Z caller=tls_config.go:277 level=info service=http/server component=query msg="TLS is disabled." http2=false address=[::]:10902 ts=2024-06-05T17:49:22.800112333Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready ts=2024-06-05T17:49:22.800210663Z caller=grpc.go:131 level=info service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901 ts=2024-06-05T17:49:27.808142026Z caller=endpointset.go:425 level=info component=endpointset msg="adding new sidecar with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI]" address=monitoring-thanos-discovery:10901 extLset="{cluster=\"staging\", prometheus=\"monitoring/monitoring-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-0\", region=\"eu-west-1\", stage=\"staging\"}" ```

Anything else we need to know:

MichaHoffmann commented 3 weeks ago

How many blocks do you have? Can you bump to 0.35.0?

sourcehawk commented 3 weeks ago

How many blocks do you have? Can you bump to 0.35.0?

What do you mean by how many blocks?

Bumped thanos versions to 0.35.1, redeployed with all the stores set. The red line that is spiking in the graphs is caused by the EKS node which has storegateway on it:

image

MichaHoffmann commented 3 weeks ago

I guess it's fetching block metadata on startup ( the gateway) does it stabilize eventually and is the querier less noisy?

sourcehawk commented 3 weeks ago

It stays that high indefinitely. The fact that it "stabilizes" is not really a good thing when it stays at 500+MB/s bandwidth 😅

MichaHoffmann commented 3 weeks ago

Storage gw is also 0.35.0 right?

sourcehawk commented 3 weeks ago

Storage gw is also 0.35.0 right?

Yeah all thanos components are now 0.35.1

MichaHoffmann commented 3 weeks ago

How many blocks do you have in object storage roughly? Is your compactor working well?

sourcehawk commented 3 weeks ago

Compactor seems to be doing it's job quite well and there are less than 800 objects reported totaling little under 30GB of data.

image

douglascamata commented 3 weeks ago

@sourcehawk traffic between querier and store gateway is triggered by incoming queries. We can't say 500 MB/s is unnatural unless we know a bunch of things, like:

sourcehawk commented 3 weeks ago

After much debugging we've come to realize the traffic is probably being generated due to an infinite recursion loop happening between the thanos ruler and the querier when the ruler is added as a store on the querier. My best bet is because the ruler queries the querier but the querier also queries the ruler, causing an infinite call loop to both the sidecar and the store gateway.

image

This is the network traffic after removing thanos ruler, as can be seen, the traffic of all types drops almost instantaneously to zero. image

douglascamata commented 3 weeks ago

Interesting. You can deploy a separate querier that will query almost everything, excluding the Ruler, and point the Ruler to this one.

evilr00t commented 3 weeks ago

Just out of curiosity - if you'd enable remote_write on Ruler that would stop Store API on it - possibly that could help?

sourcehawk commented 2 days ago

It would be great if someone could elaborate on whether or not the ruler was ever intended to be added as a store on the Querier or not. If not then I'll close this issue.