help needed: how to debug super slow Thanos (and Prometheus)

calestyo commented 10 months ago

Hey there.

A while ago I set up Prometheus at the university to monitor a local compute cluster. We want both, very detailed short term data (10s resolution for at least about the last week) and long term data (e.g for years, though in principle, a low resolution would be enough for that).
For starters, I wanted to visualise node exporter’s data, which I did via the Full Node Exporter Grafana dashboard.

That was terribly slow. I mean, as soon as I plotted any range larger than a few days ... it took ages to plot, especially e.g. the "CPU Basic" panel.
And it was completely unusable when I plotted many "CPU Basic" panels for multiple instances.

I was told, that this is because I get data processed at 10s resolution, which is of course a lot.
Further, Thanos (in specific the compactor) was suggested as a solution.

I have now roughly the following setup:

Debian stable, all VMs
two (different) Prometheus servers, both scraping the same nodes (so for HA)
one has Thanos sidecar running, which writes data to some sshfs mountpoint (no funding at university for AWS or so :-P)
one pushes data via remote write to another server on which Thanos receive runs
on that 3rd server, there's Thanos receive (which also writes to a sshfs mountpount, with the same remote storage as the other), query, compact and store running as well as Grafana.
the Grafana is accessed via an Apache reverse proxy (running on another node)

Thanos configurations are:

sidecar --log.level=warn --tsdb.path=/var/lib/prometheus/metrics2 --hash-func=SHA256 --objstore.config-file=/etc/thanos/data-thanos.yml --grpc-address=:10901 --http-address=:10902

receive --log.level=warn --grpc-address=:10907 --hash-func=SHA256 --objstore.config-file=/etc/thanos/data-thanos.yml --receive.tenant-label-name=_prometheus_group --receive.default-tenant-id=etp.physik.lmu.de --label=_prometheus_replica=0 --remote-write.address=:10908 --http-address=:10909 --tsdb.path=/data/main/thanos/receive/tsdb --tsdb.retention=2d

query --log.level=warn --grpc-address=localhost:10903 --http-address=:10904 --web.external-prefix=/thanos/query --web.route-prefix=/ --query.replica-label=_prometheus_replica --endpoint=lcg-lrz-prometheus.grid.lrz.de:10901 --endpoint=lcg-lrz-monitor-ext.grid.lrz.de:10907 --endpoint=lcg-lrz-monitor-ext.grid.lrz.de:10905

store --log.level=warn --grpc-address=:10905 --data-dir=/data/main/thanos/store/cache --objstore.config-file=/etc/thanos/data-thanos.yml --consistency-delay=1h --max-time=-2d --http-address=:10906 --web.external-prefix=/thanos/store

compact --log.level=warn --wait --data-dir=/data/main/thanos/compact/cache --objstore.config-file=/etc/thanos/data-thanos.yml --consistency-delay=1h --hash-func=SHA256 --http-address=:10912 --web.external-prefix=/thanos/compact --web.route-prefix=/

In principle it looks all good, compaction happens. I see both label set.

However, with Thanos it seems even slower than with Prometheus. Grafana nearly always runs into a timeout when I do e.g. "the last 30 days" (despite I have as of now only data for a bit over a week).

From the query UI it's not much better. I just queried node_cpu_seconds_total{instance="someSingleNode"}.... over 2 weeks, with deduplication and no resolution... which took 44446ms...
That yields 512 series. That's a showstoppper when one wants to visualise e.g. the CPU utilisation of a whole cluster with n nodes in a dashboard.

Now I'm having a hard time to debug why it is low. The main VM, where query and store runs should be quite decent (32GB RAM, 16 (logical) CPU cores).
The network, especially also to the sshfs mountpoint should be nice.

I made a test, where I first dropped all fs caches on the node where the data actually lies, and then read on the querey node all data on the sshfs mountpoint via:

/usr/bin/time --verbose find /data/thanos/ -type f -exec cat {} \; >/dev/null

That gave ~ 165 MiB/s of IO (via the sshfs)... a bit above 7mins for 73GiB.
What might however be bad, is IO to the local (virtual) disk on the VMs.... but I thought that wouldn't be really used that much for query (but only for receive)?

When doing the query from above in the query UI with the 512 series repeatedly, it starts at 55s, then takes 38s, then 30s... but this is it.
The data points that I can hover over in the UI are already pretty low res, between two points there's about 1h20min - despite "Only raw data" being selected... what?

The --log.level=debug entry for such query looks like:

Dec 08 03:16:07 lcg-lrz-monitor-ext thanos@query[966008]: ts=2023-12-08T02:16:06.999240504Z caller=proxy.go:364 level=debug component=proxy request="min_time:1700791866834 max_time:1702001766834 matchers:<name:\"instance\" value:\"lcg-lrz-dcache0.grid.lrz.de\" > matchers:<name:\"__name__\" value:\"node_cpu_seconds_total\" > aggregates:COUNT aggregates:SUM partial_response_strategy:ABORT without_replica_labels:\"_prometheus_replica\" " msg="Series: started fanout streams" status="store Addr: lcg-lrz-monitor-ext.grid.lrz.de:10905 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"0\"},{_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"1\"} MinTime: 1701206583857 MaxTime: 1701828962054 queried;store Addr: lcg-lrz-prometheus.grid.lrz.de:10901 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"1\"} MinTime: 1701206583857 MaxTime: 9223372036854775807 queried;store Addr: lcg-lrz-monitor-ext.grid.lrz.de:10907 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"0\"} MinTime: 1701813600103 MaxTime: 9223372036854775807 queried"

No real change when I select "Auto downsampling", though with that at least I see something in the request about the resolution:

Dec 08 03:20:05 lcg-lrz-monitor-ext thanos@query[966008]: ts=2023-12-08T02:20:05.191311512Z caller=proxy.go:364 level=debug component=proxy request="min_time:1700788805006 max_time:1702002005006 matchers:<name:\"instance\" value:\"lcg-lrz-dcache0.grid.lrz.de\" > matchers:<name:\"__name__\" value:\"node_cpu_seconds_total\" > max_resolution_window:967600 aggregates:COUNT aggregates:SUM partial_response_strategy:ABORT without_replica_labels:\"_prometheus_replica\" " msg="Series: started fanout streams" status="store Addr: lcg-lrz-monitor-ext.grid.lrz.de:10905 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"0\"},{_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"1\"} MinTime: 1701206583857 MaxTime: 1701829202053 queried;store Addr: lcg-lrz-prometheus.grid.lrz.de:10901 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"1\"} MinTime: 1701206583857 MaxTime: 9223372036854775807 queried;store Addr: lcg-lrz-monitor-ext.grid.lrz.de:10907 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"0\"} MinTime: 1701813600103 MaxTime: 9223372036854775807 queried"

Even with Max. 5m downsampling it doesn't get better (though I'd probably need Min to make sure that I only take downsampled data?).

When I go via the Grafana Full Node Exporter, and take the last 30 days, I get queries like:

Dec 08 03:24:13 lcg-lrz-monitor-ext thanos@query[966008]: ts=2023-12-08T02:24:13.08320073Z caller=proxy.go:364 level=debug component=proxy request="min_time:1699405190000 max_time:1702000800000 matchers:<name:\"instance\" value:\"gar-ws-bl28.garching.physik.uni-muenchen.de\" > matchers:<name:\"job\" value:\"node\" > matchers:<name:\"__name__\" value:\"node_cpu_seconds_total\" > aggregates:COUNTER without_replica_labels:\"_prometheus_replica\" " msg="Series: started fanout streams" status="store Addr: lcg-lrz-monitor-ext.grid.lrz.de:10905 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"0\"},{_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"1\"} MinTime: 1701206583857 MaxTime: 1701829452056 queried;store Addr: lcg-lrz-prometheus.grid.lrz.de:10901 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"1\"} MinTime: 1701206583857 MaxTime: 9223372036854775807 queried;store Addr: lcg-lrz-monitor-ext.grid.lrz.de:10907 LabelSets: {_prometheus_group=\"etp.physik.lmu.de\", _prometheus_replica=\"0\"} MinTime: 1701813600103 MaxTime: 9223372036854775807 queried"

So no mentioning of max_resolution_window or so. Seems it rather queries the full 10s resolution, which is of course bad.

But even in Thanos query UI it's super slow.

So I tried Store Filtering there and selected only Thanos store as source. Of course I don't get the data from the last 48h or so with that, but plotting time goes down to 8-9s ... IMO still far too slow, but might indicate that the big part of the above 50-30s could be the VMs virtual disk.
However, even without that it's still 8s for not that much.

My query has that 3 endpoints:

receive (has only about the last 2d of data), same host as query
store, same host as query, all data, but only that older than ~2 d
sidecar (it's Prometheus has a retention of 36GB, which as of now is still everything), other host

When store filtering only single ones of them, the above query gives: ~1s for (1), 8-9s for (2) and 35-50s for (3).
Of course, all show different data, and right now only (3) has everything (but of course, will loose older data over time).

Ok, VM virtual disk IO may be bad... but why is it mostly good for (1) (on the local host), but so much worse for (3) (on another host)? I rather don't think network is an issue.
And why is it still quite bad for (2) (8-9s...).

Any ideas what I'm doing wrong?

Thanks, Chris.

MichaHoffmann commented 10 months ago

Do you have tracing enabled? If not could you rerun your experiment with tracing please?

calestyo commented 10 months ago

Not yet, I'll have to look into it first and see how it works.

But one thing that came to my mind:

Since I have these 3 endpoints configured for query... one sidecar and one receive with each (both have a min time but no max time) reflecting one of the two Prometheuses … and additionally the store which reflects both (though only for data older than 2d)...

... doesn't that also mean that I get the metrics for a large part at least twice (and if one counts in that there are two replicas: actually 4 times)... and all has to be deduplicated?

So in my CPU query which already results in 512 seres (the nodes has 64 logical cores), I would actually transmit a bit less (less, because for the last 2d there's nothing in store) than 4*512 series.

Could that kill my performance?

MichaHoffmann commented 10 months ago

Yes data has to be deduplicated but that should really not be an issue

calestyo commented 10 months ago

Is there a way how I can see, which data it actually loads for a given query (I mean from which endpoints, and for which range the data is and how many samples it got, per endpoint)? Or is that just what the tracing would do?

calestyo commented 9 months ago

Tried the tracing now, but didn't get that working... there doesn't really seem much (any?) documentation available on how to even set things up.

Out of the backends listed at https://thanos.io/tip/thanos/tracing.md/,Jaeger seems the only non-proprietary one, and setting that up seems even more complex than setting up Thanos itself ^^

MichaHoffmann commented 9 months ago

Tried the tracing now, but didn't get that working... there doesn't really seem much (any?) documentation available on how to even set things up.

Out of the backends listed at https://thanos.io/tip/thanos/tracing.md/,Jaeger seems the only non-proprietary one, and setting that up seems even more complex than setting up Thanos itself ^^

So iirc jaeger has an all in one deployment that might suffice as a quick way to get traces or maybe a free plan from some proprietary offering might suffice too if that is possible.

The main VM, where query and store runs should be quite decent (32GB RAM, 16 (logical) CPU cores).

It feels more like an IO issue but its really hard to tell without a trace to look at. As an alternative; can you maybe try to use store filtering in the thanos UI to see which store responds slowly?

thanos-io / thanos

help needed: how to debug super slow Thanos (and Prometheus) #6968