Query: Network bandwidth usage upward of 500MB/s between Querier and configured stores

sourcehawk commented 5 months ago

Thanos, Prometheus and Golang version used: Thanos: 0.34.1 Prometheus: v2.51.0 Golang: No Idea, running in containers on EKS

Object Storage Provider: S3

What happened: Thanos Querier causes insane network traffic. By insane I mean up to half a gigabyte of network bandwidth PER SECOND. I do not think anything but an infinite recursion loop could explain such amounts of network usage. Thanos Querier is currently the largest cost factor of our EKS environment, costing more than the entire compute infrastructure due to this network bandwidth. This is currently affecting all of the clusters we have the monitoring stack deployed on.

Here's an image depicting the network usage over a span of a few days. The leftmost graph showing the bandwidth of two EKS nodes totaling over 600MB/s inbound network bandwidth with a very weird pattern. At the same time over 150MB/s outbound traffic and 200K packets being sent every second. At the end of the timeline I scaled querier deployment to 0, indicating that thanos querier is the sole culprit of this insane network bandwidth usage.

In the following configuration extracted from my querier pod definition, you can see that I have stores for thanos sidecar (thanos-discovery), storegateway, receive and ruler.

Args:
      query
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --query.replica-label=replica
      --endpoint=monitoring-thanos-discovery:10901
      --endpoint=monitoring-thanos-storegateway:10901
      --endpoint=monitoring-thanos-receive:10901
      --endpoint=thanos-ruler-operated:10901

The rise in network traffic on the graph below happens when I scaled up thanos querier from 0 to 1. One thing to note on the leftmost graph below is that there are two instances with high traffic (network in - bytes), one of those is running thanos storegateway and the other is running thanos querier.

Now if I remove the storegateway from the list of endpoints on the querier, my network traffic (network in) is reduced by alot, and only one EKS node reports high inbound bandwidth, namely the one running querier. The outbound traffic is coming from the same instance as where thanos sidecar pod resides (prometheus deployment).

Note that I also tried removing all the endpoints except for --endpoint=monitoring-thanos-discovery:10901 which led to no change from the above graph. This means the traffic is solely being generated by --endpoint=monitoring-thanos-discovery:10901 and --endpoint=monitoring-thanos-storegateway:10901 in the previous images.

Is thanos sidecar producing 80MB/s data? That is simply not possible... that would mean I am generating 288GB of data on sidecar every hour, which is 6.9Terabytes per day. The instances don't even have the disk capacity to store one hour of data of that amount. I am not even talking about the amount of network bandwidth that is being used when storegateway is specified as a store on the querier - at which point the inbound network usage could be 43 Terabytes per day.

Where is this traffic coming from? Is this amount of network traffic expected?

What you expected to happen: The network traffic to be a reasonable few megabytes per second.

How to reproduce it (as minimally and precisely as possible): Not sure

Full logs to relevant components:

Sidecar pod logs:

``` [me@work:~/Documents/work/devops/ci/setup]$ kubectl logs prometheus-monitoring-prometheus-0 -c thanos-sidecar -n monitoring ts=2024-06-05T14:51:47.754630201Z caller=options.go:26 level=info protocol=gRPC msg="disabled TLS, key and cert must be set to enable" ts=2024-06-05T14:51:47.757884479Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-06-05T14:51:47.761733659Z caller=sidecar.go:383 level=info msg="starting sidecar" ts=2024-06-05T14:51:47.762002683Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy ts=2024-06-05T14:51:47.762034728Z caller=http.go:73 level=info service=http/server component=sidecar msg="listening for requests and metrics" address=:10902 ts=2024-06-05T14:51:47.762976572Z caller=tls_config.go:274 level=info service=http/server component=sidecar msg="Listening on" address=[::]:10902 ts=2024-06-05T14:51:47.765306177Z caller=tls_config.go:277 level=info service=http/server component=sidecar msg="TLS is disabled." http2=false address=[::]:10902 ts=2024-06-05T14:51:47.765901598Z caller=reloader.go:238 level=info component=reloader msg="nothing to be watched" ts=2024-06-05T14:51:47.766609856Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready ts=2024-06-05T14:51:47.766776773Z caller=grpc.go:131 level=info service=gRPC/server component=sidecar msg="listening for serving gRPC" address=:10901 ts=2024-06-05T14:51:47.771849892Z caller=sidecar.go:195 level=info msg="successfully loaded prometheus version" ts=2024-06-05T14:51:47.831208649Z caller=sidecar.go:217 level=info msg="successfully loaded prometheus external labels" external_labels="{cluster=\"staging\", prometheus=\"monitoring/monitoring-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-0\", region=\"eu-west-1\", stage=\"staging\"}" ts=2024-06-05T14:51:49.767200216Z caller=shipper.go:263 level=warn msg="reading meta file failed, will override it" err="failed to read /prometheus/thanos.shipper.json: open /prometheus/thanos.shipper.json: no such file or directory" ts=2024-06-05T17:51:49.951882756Z caller=shipper.go:361 level=info msg="upload new block" id=01HZMRE5DMF0X8Z2RP537KP904 ```

Thanos querier logs:

``` kubectl logs monitoring-thanos-query-5c9b7d4d56-qslhc -n monitoring ts=2024-06-05T17:49:22.797512193Z caller=options.go:26 level=info protocol=gRPC msg="disabled TLS, key and cert must be set to enable" ts=2024-06-05T17:49:22.798966265Z caller=query.go:813 level=info msg="starting query node" ts=2024-06-05T17:49:22.799653428Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy ts=2024-06-05T17:49:22.799682866Z caller=http.go:73 level=info service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902 ts=2024-06-05T17:49:22.800013879Z caller=tls_config.go:274 level=info service=http/server component=query msg="Listening on" address=[::]:10902 ts=2024-06-05T17:49:22.8000369Z caller=tls_config.go:277 level=info service=http/server component=query msg="TLS is disabled." http2=false address=[::]:10902 ts=2024-06-05T17:49:22.800112333Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready ts=2024-06-05T17:49:22.800210663Z caller=grpc.go:131 level=info service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901 ts=2024-06-05T17:49:27.808142026Z caller=endpointset.go:425 level=info component=endpointset msg="adding new sidecar with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI]" address=monitoring-thanos-discovery:10901 extLset="{cluster=\"staging\", prometheus=\"monitoring/monitoring-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-0\", region=\"eu-west-1\", stage=\"staging\"}" ```

Anything else we need to know:

MichaHoffmann commented 5 months ago

How many blocks do you have? Can you bump to 0.35.0?

sourcehawk commented 5 months ago

How many blocks do you have? Can you bump to 0.35.0?

What do you mean by how many blocks?

Bumped thanos versions to 0.35.1, redeployed with all the stores set. The red line that is spiking in the graphs is caused by the EKS node which has storegateway on it:

MichaHoffmann commented 5 months ago

I guess it's fetching block metadata on startup ( the gateway) does it stabilize eventually and is the querier less noisy?

sourcehawk commented 5 months ago

It stays that high indefinitely. The fact that it "stabilizes" is not really a good thing when it stays at 500+MB/s bandwidth 😅

MichaHoffmann commented 5 months ago

Storage gw is also 0.35.0 right?

sourcehawk commented 5 months ago

Storage gw is also 0.35.0 right?

Yeah all thanos components are now 0.35.1

MichaHoffmann commented 5 months ago

How many blocks do you have in object storage roughly? Is your compactor working well?

sourcehawk commented 5 months ago

Compactor seems to be doing it's job quite well and there are less than 800 objects reported totaling little under 30GB of data.

douglascamata commented 5 months ago

@sourcehawk traffic between querier and store gateway is triggered by incoming queries. We can't say 500 MB/s is unnatural unless we know a bunch of things, like:

Is this querier getting lots of queries as soon as it comes up?
Are these queries over long time frames?
Are these queries touching many series?
What's the size of your blocks?

sourcehawk commented 5 months ago

After much debugging we've come to realize the traffic is probably being generated due to an infinite recursion loop happening between the thanos ruler and the querier when the ruler is added as a store on the querier. My best bet is because the ruler queries the querier but the querier also queries the ruler, causing an infinite call loop to both the sidecar and the store gateway.

This is the network traffic after removing thanos ruler, as can be seen, the traffic of all types drops almost instantaneously to zero.

douglascamata commented 5 months ago

Interesting. You can deploy a separate querier that will query almost everything, excluding the Ruler, and point the Ruler to this one.

evilr00t commented 5 months ago

Just out of curiosity - if you'd enable remote_write on Ruler that would stop Store API on it - possibly that could help?

sourcehawk commented 4 months ago

It would be great if someone could elaborate on whether or not the ruler was ever intended to be added as a store on the Querier or not. If not then I'll close this issue.

andrejshapal commented 4 months ago

Can someone confirm this is correct there is a loop. i don't belive it can do the same query several times. But if there is infinite loop indeed, this is very expensive issue.

MichaHoffmann commented 4 months ago

The confusing thing about this to me is that ruler should not ask the querier to answer queries, it should answer from its local tsdb really. Querier is only needed if it needs to evaluate rules which is a completely different concern. In theory it should not be possible to build a loop with query evaluation between ruler and querier.

@sourcehawk can you share your ruler configuration please?

yeya24 commented 4 months ago

I don't think it is really a loop. The queries originated from Ruler to Querier and requests from Querier to Ruler are two different code path. The request from Querier to Ruler should only query Ruler's local TSDB.

Do you have really heavy rules that queries long term data?

sourcehawk commented 4 months ago

We are using kube-prometheus-stack and bitnami/thanos helm charts together. While writing it up I stumbled upon this particular section which looks quite suspicious:

  thanosRuler:
    thanosRulerSpec:
      queryEndpoints:
        - http://monitoring-thanos-query:9090

Although the description of that particular setting is the following

QueryEndpoints defines Thanos querier endpoints from which to query metrics. Maps to the --query flag of thanos ruler. queryEndpoints: []

which makes it sounds like what I configured (the url to the thanos querier) is the expected value. :shrug:

Condensed version of the setup (values.yaml)

```yaml kube-prometheus-stack: nameOverride: monitoring fullnameOverride: monitoring prometheus: # Enable thanos sidecar thanosService: enabled: true # Enable service monitor for sidecar thanosServiceMonitor: enabled: true # Prometheus operator config prometheusSpec: scrapeInterval: "2m" scrapeTimeout: "60s" evaluationInterval: "2m" retention: 7d thanos: objectStorageConfig: existingSecret: name: "thanos-s3-bucket" key: "objstore.yml" grafana: enabled: true alertmanager: enabled: true apiVersion: v2 # Enable thanos ruler thanosRuler: enabled: true thanosRulerSpec: alertmanagersConfig: existingSecret: key: alertmanager-config.yaml name: thanos-ruler-alertmanager-config objectStorageConfig: existingSecret: key: objstore.yml name: thanos-s3-bucket queryEndpoints: - http://monitoring-thanos-query:9090 thanos: existingObjstoreSecret: "thanos-s3-bucket" existingObjstoreSecretItems: - key: objstore.yml path: objstore.yml query: enabled: true dnsDiscovery: enabled: false stores: - "monitoring-thanos-discovery:10901", # sidecar - "monitoring-thanos-storegateway:10901", # storegateway - "monitoring-thanos-receive:10901", # receive - "monitoring-thanos-ruler:10901" # ruler service: labels: app: "thanos-query" queryFrontend: enabled: true service: labels: app: "thanos-query-frontend" bucketweb: enabled: false service: labels: app: "thanos-bucketweb" compactor: persistence: size: 50Gi enabled: true service: labels: app: "thanos-compactor" storegateway: enabled: true service: labels: app: "thanos-storegateway" ruler: enabled: false service: labels: app: "thanos-ruler" receive: enabled: true logLevel: debug tsdbRetention: 2d service: ports: grpc: 10901 http: 10902 remote: 10903 labels: app: "thanos-receive" metrics: enabled: true serviceMonitor: enabled: true jobLabel: app ```

Thanos Ruler pod definition

```yaml apiVersion: v1 kind: Pod metadata: annotations: kubectl.kubernetes.io/default-container: thanos-ruler creationTimestamp: "2024-07-02T17:51:36Z" generateName: thanos-ruler-monitoring-thanos-ruler- labels: app.kubernetes.io/instance: monitoring-thanos-ruler app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: thanos-ruler apps.kubernetes.io/pod-index: "0" controller-revision-hash: thanos-ruler-monitoring-thanos-ruler-7646fb6464 statefulset.kubernetes.io/pod-name: thanos-ruler-monitoring-thanos-ruler-0 thanos-ruler: monitoring-thanos-ruler name: thanos-ruler-monitoring-thanos-ruler-0 namespace: monitoring ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: StatefulSet name: thanos-ruler-monitoring-thanos-ruler uid: dd8a1214-816f-4126-a57a-bca8dd5c18f7 resourceVersion: "65626621" uid: 36c9e9a1-cf38-46e2-b709-85d602adebc4 spec: containers: - args: - rule - --data-dir=/thanos/data - --eval-interval=15s - --tsdb.retention=24h - --label=thanos_ruler_replica="$(POD_NAME)" - --alert.label-drop=thanos_ruler_replica - --log.format=logfmt - --rule-file=/etc/thanos/rules/*/*.yaml - --query=http://monitoring-thanos-query:9090 - --alertmanagers.config-file=/etc/thanos/config/alertmanager-config/alertmanager-config.yaml - --objstore.config-file=/etc/thanos/config/objstorage-config/objstore.yml - --web.external-prefix=http://monitoring-thanos-ruler.monitoring:10902 - --web.route-prefix=/ - --http.config=/etc/thanos/web_config/web-config.yaml env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: AWS_STS_REGIONAL_ENDPOINTS value: - name: AWS_DEFAULT_REGION value: - name: AWS_REGION value: - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: image: quay.io/thanos/thanos:v0.35.0 imagePullPolicy: IfNotPresent name: thanos-ruler ports: - containerPort: 10901 name: grpc protocol: TCP - containerPort: 10902 name: web protocol: TCP resources: limits: cpu: 150m memory: 200Mi requests: cpu: 100m memory: 150Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/thanos/config/alertmanager-config name: alertmanager-config - mountPath: /etc/thanos/config/objstorage-config name: objstorage-config - mountPath: /etc/thanos/certs name: tls-assets readOnly: true - mountPath: /etc/thanos/web_config/web-config.yaml name: web-config readOnly: true subPath: web-config.yaml - mountPath: /thanos/data name: thanos-ruler-monitoring-thanos-ruler-data - mountPath: /etc/thanos/rules/thanos-ruler-monitoring-thanos-ruler-rulefiles-0 name: thanos-ruler-monitoring-thanos-ruler-rulefiles-0 - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-ffc77 readOnly: true - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true - args: - --listen-address=:8080 - --web-config-file=/etc/thanos/web_config/web-config.yaml - --reload-url=http://127.0.0.1:10902/-/reload - --watched-dir=/etc/thanos/rules/thanos-ruler-monitoring-thanos-ruler-rulefiles-0 command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "-1" - name: AWS_STS_REGIONAL_ENDPOINTS value: - name: AWS_DEFAULT_REGION value: - name: AWS_REGION value: - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: image: quay.io/prometheus-operator/prometheus-config-reloader:v0.74.0 imagePullPolicy: IfNotPresent name: config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP resources: limits: cpu: 50m memory: 100Mi requests: cpu: 25m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/thanos/rules/thanos-ruler-monitoring-thanos-ruler-rulefiles-0 name: thanos-ruler-monitoring-thanos-ruler-rulefiles-0 - mountPath: /etc/thanos/web_config/web-config.yaml name: web-config readOnly: true subPath: web-config.yaml - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-ffc77 readOnly: true - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostname: thanos-ruler-monitoring-thanos-ruler-0 nodeName: ip-10-200-0-135.eu-west-1.compute.internal preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 2000 runAsGroup: 2000 runAsNonRoot: true runAsUser: 1000 seccompProfile: type: RuntimeDefault serviceAccount: monitoring serviceAccountName: monitoring subdomain: thanos-ruler-operated terminationGracePeriodSeconds: 120 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: aws-iam-token projected: defaultMode: 420 sources: - serviceAccountToken: audience: sts.amazonaws.com expirationSeconds: 86400 path: token - name: alertmanager-config secret: defaultMode: 420 items: - key: alertmanager-config.yaml path: alertmanager-config.yaml secretName: thanos-ruler-alertmanager-config - name: objstorage-config secret: defaultMode: 420 items: - key: objstore.yml path: objstore.yml secretName: thanos-s3-bucket - name: tls-assets projected: defaultMode: 420 sources: - secret: name: thanos-ruler-monitoring-thanos-ruler-tls-assets-0 - name: web-config secret: defaultMode: 420 secretName: thanos-ruler-monitoring-thanos-ruler-web-config - configMap: defaultMode: 420 name: thanos-ruler-monitoring-thanos-ruler-rulefiles-0 name: thanos-ruler-monitoring-thanos-ruler-rulefiles-0 - emptyDir: {} name: thanos-ruler-monitoring-thanos-ruler-data - name: kube-api-access-ffc77 projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace status: conditions: - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:44Z" status: "True" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:36Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:44Z" status: "True" type: Ready - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:44Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:36Z" status: "True" type: PodScheduled containerStatuses: - containerID: containerd://5358988b527d67d1ac7ee7e9hgh45431e23dccf51b7409f12327b0956b371cb2c9 image: quay.io/prometheus-operator/prometheus-config-reloader:v0.74.0 imageID: quay.io/prometheus-operator/prometheus-config-reloader@sha256:d55631c7a740d355egdfvbdc9e48c1e91d0f08d10d861c8928b4df4b3b96b4c9 lastState: {} name: config-reloader ready: true restartCount: 0 started: true state: running: startedAt: "2024-07-02T17:51:43Z" - containerID: containerd://2a27465237d96500f6d57602df3453434430e1dfa75329674d2d8701fe80ca484 image: quay.io/thanos/thanos:v0.35.0 imageID: quay.io/thanos/thanos@sha256:fa1d28718df00b68d6ad85d7c7d4703bd9f59e5cd8be8da6540ea398cf701a1f lastState: {} name: thanos-ruler ready: true restartCount: 0 started: true state: running: startedAt: "2024-07-02T17:51:41Z" hostIP: 10.200.0.135 hostIPs: - ip: 10.200.0.135 phase: Running podIP: 100.64.113.176 podIPs: - ip: 100.64.113.176 qosClass: Burstable startTime: "2024-07-02T17:51:36Z" ```

Thanos querier pod definition

```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: "2024-07-02T17:51:20Z" generateName: monitoring-thanos-query-c5554db47- labels: app.kubernetes.io/component: query app.kubernetes.io/instance: monitoring app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: thanos app.kubernetes.io/version: 0.35.1 helm.sh/chart: thanos-15.6.0 pod-template-hash: c5554db47 name: monitoring-thanos-query-c5554db47-rjvn5 namespace: monitoring ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: monitoring-thanos-query-c5554db47 uid: c12dae8c-5004-4d5a-89b6-4d94dc99bfe8 resourceVersion: "65626720" uid: ad45849f-e2c9-436d-b9a8-b45e90d3f7b7 spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: query app.kubernetes.io/instance: monitoring app.kubernetes.io/name: thanos topologyKey: kubernetes.io/hostname weight: 1 automountServiceAccountToken: true containers: - args: - query - --log.level=info - --log.format=logfmt - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --query.replica-label=replica - --endpoint=monitoring-thanos-discovery:10901 - --endpoint=monitoring-thanos-storegateway:10901 - --endpoint=monitoring-thanos-receive:10901 - --endpoint=monitoring-thanos-ruler:10901 - --alert.query-url=http://monitoring-thanos-query.monitoring.svc.cluster.local:9090 env: - name: AWS_STS_REGIONAL_ENDPOINTS value: - name: AWS_DEFAULT_REGION value: - name: AWS_REGION value: - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: image: docker.io/bitnami/thanos:0.35.1-debian-12-r0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 httpGet: path: /-/healthy port: http scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 30 name: query ports: - containerPort: 10902 name: http protocol: TCP - containerPort: 10901 name: grpc protocol: TCP readinessProbe: failureThreshold: 6 httpGet: path: /-/ready port: http scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 30 resources: limits: cpu: "1" memory: 500Mi requests: cpu: 500m memory: 250Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL privileged: false readOnlyRootFilesystem: true runAsGroup: 1001 runAsNonRoot: true runAsUser: 1001 seLinuxOptions: {} seccompProfile: type: RuntimeDefault terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-l7lkq readOnly: true - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true nodeName: ip-10-200-0-169.eu-west-1.compute.internal preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001 fsGroupChangePolicy: Always serviceAccount: monitoring serviceAccountName: monitoring terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: aws-iam-token projected: defaultMode: 420 sources: - serviceAccountToken: audience: sts.amazonaws.com expirationSeconds: 86400 path: token - name: kube-api-access-l7lkq projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace status: conditions: - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:24Z" status: "True" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:20Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-07-02T17:52:00Z" status: "True" type: Ready - lastProbeTime: null lastTransitionTime: "2024-07-02T17:52:00Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-07-02T17:51:20Z" status: "True" type: PodScheduled containerStatuses: - containerID: containerd://f3ef8fa71c4e0cccd67ae69c6bc23aead3453453453440f8ce3c62fe2132f2 image: docker.io/bitnami/thanos:0.35.1-debian-12-r0 imageID: docker.io/bitnami/thanos@sha256:7df4fab7194df186f760f35e6f69a79da47aecb0c905aba3e9dd0bc705f28a0c lastState: {} name: query ready: true restartCount: 0 started: true state: running: startedAt: "2024-07-02T17:51:24Z" hostIP: 10.200.0.169 hostIPs: - ip: 10.200.0.169 phase: Running podIP: 100.64.140.97 podIPs: - ip: 100.64.140.97 qosClass: Burstable startTime: "2024-07-02T17:51:20Z" ```

hamishforbes commented 2 months ago

@sourcehawk did you get anywhere with this issue?

I am seeing very similar behaviour, excessive and growing network traffic on the querier, but i am not using Ruler and have a much different use case.

I have 3 prometheus instances setup to scrape their own AWS availability zone only, to reduce cross-AZ data transfer charges. These prometheus instances remote write to a central Mimir cluster which does the bulk of our metric querying. However I wanted to keep autoscaling metric queries from keda within the cluster to avoid going cross-region etc.

So I have thanos query deployed with the prometheus instances (and their thanos sidecar) as endpoints, the querier only receives autoscaling queries which are instant queries of the form:

sum(
  sum by (namespace, pod, container) (irate(container_cpu_usage_seconds_total{job="kubelet", image!="", namespace="$NAMESPACE",container="$CONTAINER_NAME"}[1m]))
  / on (namespace,pod,container) group_left(resource)
  min(kube_pod_container_resource_limits{resource="memory"}) by (namespace,pod,container,resource)
) * 100

The kube_pod_container_resource_limits series exists on 1 of the prometheus instances and the relevant container_cpu_usage_seconds_total series are (potentially) spread across all 3 .

The response to these queries is ~130 bytes.

In the particular environment I am debugging I'm seeing inbound network bandwidth to the thanos querier pods of over 6MB/s which is 2x the total remote_write outbound bandwidth. This seems very wrong.

I duplicated my thanos querier deployment to isolate it. Manually running one of the autoscaling queries above once against a single querier pod resulted in a network graph like this Screenshot 2024-08-22 at 12 26 00

However... if i modify the query to include the {namespace="$NAMESPACE",container="$CONTAINER_NAME"} selectors on the kube_pod_container_resource_limits part of the query as well, I get this:

Screenshot 2024-08-22 at 12 26 18

Rolling out this query change in a staging environment dropped thanos-querier ingress from ~6MB/s to 35KB/s.

Is the thanos-querier trying to stream all the series' in kube_pod_container_resource_limits everytime a query is run?

Is it possible you have a recording rule or something thats also causing this kind of pathological behaviour in the querier?

Is it viable for the querier to optimise join queries like this?

sourcehawk commented 2 months ago

@hamishforbes Interesting... I didn't find a solution, but the problem disappeared after I disabled the thanos ruler deployment provided by the kube-prometheus-stack helm chart, as I mentioned in an earlier comment. I am using kube-prometheus-stack helm chart's default ruleset in my cluster. Your problem seems to be quite small compared to what I was seeing, but I also likely have a much larger set of rules than you do here.

I'd say it's quite clear that this is being caused by recording rules since the Thanos Ruler is the one that manages them, and I saw a drop from maxed out bandwidth on AWS instances to under 1MB/s instantly when removing Thanos Ruler. I am quite sure that if I was running larger instances on AWS, it still would have maxed out my bandwidth at some point.

MichaHoffmann commented 2 months ago

I think the Thanos engine has an optimizer that sets the same matchers on both sides of the binary expression. Maybe that is broken for that query somehow. Are you using the Thanos engine?

hamishforbes commented 2 months ago

I think the Thanos engine has an optimizer that sets the same matchers on both sides of the binary expression. Maybe that is broken for that query somehow. Are you using the Thanos engine?

No I was using the default engine, haven't tested with the Thanos engine. Ill give that a go and see what happens

irizzant commented 5 days ago

I'm hitting the same problem with Ruler.

I'm on Thanos 0.36.1 with EKS v1.30.

The cross-az network traffic with Ruler enabled was something like 200MB/s on the Store Gateway! immagine

AFAIK the documentation states that Ruler should be able to contact Alertmanagers, Query and S3 bucket, so it should be the expected setup:

68747470733a2f2f646f63732e676f6f676c652e636f6d2f64726177696e67732f642f652f32504143582d3176534a643332675068382d4d43354b6f302d502d76314b5130586e786130716d7356586f77746b7756476c637a476656572d566434313559364631323

Nevertheless, as soon as I disabled the Ruler the traffic instantly dropped, so it's clear that Ruler is bombing the Store somehow with network traffic: immagine

irizzant commented 5 days ago

I've also tried to configure Ruler with Stateless remote write as shown here but the network traffic to Store Gateway does not improve. Here is the config I use on the Ruler in the kube-prometheus-stack chart


thanosRuler:
  enabled: true
  thanosRulerSpec:
    additionalArgs:
    - name: remote-write.config
      value: |
        remote_write:
        - url: http://thanos-receive:19291/api/v1/receive
          remote_timeout: 30s
          follow_redirects: true

MichaHoffmann commented 4 days ago

@irizzant Does it happen with an empty rules file for ruler too?

irizzant commented 4 days ago

@MichaHoffmann I use kube-prometheus-stack chart which creates a bunch of Prometheus rules but they're not empty

MichaHoffmann commented 4 days ago

@MichaHoffmann I use kube-prometheus-stack chart which creates a bunch of Prometheus rules but they're not empty

Does it also happen with an empty rules file? If not ~ can you maybe bisect the rules and see if some rules cause this?

hamishforbes commented 4 days ago

FWIW i did eventually get around to enabling the Thanos query engine on our querier, which did seem to resolve the issue for my specific query. Definitely worth trying

irizzant commented 3 days ago

I think I found the problem at least in my case. I tried to disable all the kube-prometheus-stack rules and activate them one by one while monitoring network traffic.

What caused the traffic spike was enabling the kube-prometheus-stack for API server, which created this huge spike in Store Gateway: immagine

Specifically the rule that increased the traffic was kube-prometheus-stack-kube-apiserver-burnrate.rules.

I was able to enable all the other rules without traffic spikes.

MichaHoffmann commented 3 days ago

At a guess they probably fetch a lot of data from storage gateway, burnrate sounds like a thing that fetches a month or so.

irizzant commented 3 days ago

You can find its content here

yeya24 commented 3 days ago

@irizzant The most expensive rules from kube-prometheus-stack-kube-apiserver-burnrate.rules seems a rule that looks back for 3 days. This could be indeed expensive for store gateway as it tries to fetch the same index and chunks over and over again. Even if you have remote chunks cache you still need to consume bandwidth to download the cached data. What might help you is to use Redis chunks cache as it supports client side caching.

Do you configure to only query store gateways after certain time? Let's say 24h so that most of the queries from kube-prometheus-stack-kube-apiserver-burnrate.rules should try to only query your hot store (either sidecar or receiver)

irizzant commented 3 days ago

@yeya24 can you please detail the following? I'm not sure I understand where to check and how to set it up

Do you configure to only query store gateways after certain time? Let's say 24h so that most of the queries from kube-prometheus-stack-kube-apiserver-burnrate.rules should try to only query your hot store (either sidecar or receiver)

thanos-io / thanos

Query: Network bandwidth usage upward of 500MB/s between Querier and configured stores #7417