Thanos, Prometheus and Golang version used:
Thanos: v0.35.1
Prometheus v2.50.1
Object Storage Provider: AWS S3 Object Storage
What happened:
We are running thanos receive in distributor mode (ingesting and routing) with remote write from Prometheus. We have enabled HPA for both. We are using thanos receive controller for dynamically updating the hashrings
HPA works fine for ingestion pods statefulsets, but distributor pods are not scaling correctly. In monitoring, the usage is only in 1 pod, metrics too are generated for that active pod for http_requests_total. If loads increases, 1 particular distributor node starts crashing OOM with Context Deadline Exceeded error in Prometheus.
Is it anything wrong in configuration in Thanos or Prometheus side?
What you expected to happen:
The traffic load should be distributed on all thanos distributor pods and they should auto-scale when resource usage crosses a limit in HPA
How to reproduce it (as minimally and precisely as possible):
Deploy receiver in distributed mode and monitor resource usage and requests for receive distributor pods.
Thanos, Prometheus and Golang version used: Thanos: v0.35.1 Prometheus v2.50.1
Object Storage Provider: AWS S3 Object Storage
What happened: We are running thanos receive in distributor mode (ingesting and routing) with remote write from Prometheus. We have enabled HPA for both. We are using thanos receive controller for dynamically updating the hashrings HPA works fine for ingestion pods statefulsets, but distributor pods are not scaling correctly. In monitoring, the usage is only in 1 pod, metrics too are generated for that active pod for http_requests_total. If loads increases, 1 particular distributor node starts crashing OOM with Context Deadline Exceeded error in Prometheus.
Our Receiver config:
Receiver distributor config:
Hashring with 10 receive generated by thanos receive controller in auto-scaling
Prometheus Operator RemoteWrite config:
Thanos receive distributor service definition (sessionAffinity: None):
Only 1 pod is showing usage for top pod monitor.
Is it anything wrong in configuration in Thanos or Prometheus side?
What you expected to happen: The traffic load should be distributed on all thanos distributor pods and they should auto-scale when resource usage crosses a limit in HPA
How to reproduce it (as minimally and precisely as possible): Deploy receiver in distributed mode and monitor resource usage and requests for receive distributor pods.
Full logs to relevant components:
Anything else we need to know:
Environment: