Sidecar is significantly slower than the underlying Prometheus queries

jon-rei commented 10 months ago

Thanos, Prometheus and Golang version used: Thanos: v0.32.5 Prometheus: v.2.45.0

What happened: The Thanos sidecar is significantly slower than the actual Prometheus query when queried by the Thanos querier. We can see query times on the sidecar of up to 2 minutes, but the actual Prometheus query only takes a few seconds. In the end, this makes the whole Thanos setup very slow.

What you expected to happen: That the Thanos sidecar wouldn't be so different from the Prometheus.

How to reproduce it (as minimally and precisely as possible): Could be very environment dependent. We are trying to query a metric (container_network_receive_bytes_total) with ~26k series and ~6 million samples.

Anything else we need to know: The Thanos sidecar pushes metrics to our S3 bucket every 2 hours and we use the Querier to query the sidecar. We also use the Thanos query engine.

We set the following resources for the sidecar, but in reality the sidecar is just using a fraction of it and is not getting throttled at any time.

resources:
  limits:
    cpu: 3
    memory: 4Gi
  requests:
    cpu: 1
    memory: 512Mi

Traces:

I've found several other issues (#4304, #631) which are unfortunately closed without any helpful resolutions.

MichaHoffmann commented 10 months ago

This might very well be because we now force the store response to be sorted and we only can rely on that by resorting which forces us to buffer the whole resultset.

Is the prometheus thats associated with the sidecar using external labels? because if not we could probably just not buffer if im not wrong.

Just to make sure; if you use prometheus engine in thanos querier does it also take that long?

jon-rei commented 10 months ago

Our Prometheus instances all use 4 external labels.

I've just switched back to the Prometheus engine and I no longer see these drastic speed differences. For example, the underlying Prometheus query takes 1 second and the sidecar takes ~2 seconds to respond. I will try again later in the day when we have different loads on the Prometheus and the sidecars.

MichaHoffmann commented 10 months ago

Oh! Can you share the query or an anonymized version of it please? It sure sounds like the issue is in the engine then.

jon-rei commented 10 months ago

Sure, the query I'm using is:

sum(irate(container_network_receive_bytes_total{datacenter=~"xxx"}[1m])) by (namespace) > 1000000

Is there a difference between using the Thanos engine as an argument and in the query frontend? I just tested my queries again with the Thanos engine selected through the frontend and I couldn't produce this very long anymore. But the sidecar took a bit longer with the Thanos engine than with the Prometheus engine.

MichaHoffmann commented 10 months ago

It shouldnt but the query ( if its range query ) might be affected by query frontend caching so for sake of reproducibility it would be good to use query instead of queryfrontend here i think! Can you repeat just by using query service?

jon-rei commented 10 months ago

I've now tested the query directly on the query service, and in my tests the Thanos engine was always faster than the standard Prometheus engine. But in the traces I could see that the sidecar was not the problem here. In the sidecar traces I can see that with both engines the sidecars are at the same speed. As I mentioned above, ~1s for the Prometheus query and a total of 2-3 seconds for the whole sidecar.

What I'm a little bit worried about is that I couldn't reproduce any very long loading times today.

mohanisch-sixt commented 2 weeks ago

Is there any update here? We can also observe this behaviour.

schnitzel4 commented 3 days ago

We can also observe this behaviour. Its slow down our Environment.

thanos-io / thanos

Sidecar is significantly slower than the underlying Prometheus queries #6930