Open calestyo opened 10 months ago
Do you have tracing enabled? If not could you rerun your experiment with tracing please?
Not yet, I'll have to look into it first and see how it works.
But one thing that came to my mind:
Since I have these 3 endpoints configured for query
... one sidecar
and one receive
with each (both have a min time but no max time) reflecting one of the two Prometheuses … and additionally the store
which reflects both (though only for data older than 2d)...
... doesn't that also mean that I get the metrics for a large part at least twice (and if one counts in that there are two replicas: actually 4 times)... and all has to be deduplicated?
So in my CPU query which already results in 512 seres (the nodes has 64 logical cores), I would actually transmit a bit less (less, because for the last 2d there's nothing in store
) than 4*512 series.
Could that kill my performance?
Yes data has to be deduplicated but that should really not be an issue
Is there a way how I can see, which data it actually loads for a given query (I mean from which endpoints, and for which range the data is and how many samples it got, per endpoint)? Or is that just what the tracing would do?
Tried the tracing now, but didn't get that working... there doesn't really seem much (any?) documentation available on how to even set things up.
Out of the backends listed at https://thanos.io/tip/thanos/tracing.md/,Jaeger seems the only non-proprietary one, and setting that up seems even more complex than setting up Thanos itself ^^
Tried the tracing now, but didn't get that working... there doesn't really seem much (any?) documentation available on how to even set things up.
Out of the backends listed at https://thanos.io/tip/thanos/tracing.md/,Jaeger seems the only non-proprietary one, and setting that up seems even more complex than setting up Thanos itself ^^
So iirc jaeger has an all in one deployment that might suffice as a quick way to get traces or maybe a free plan from some proprietary offering might suffice too if that is possible.
The main VM, where query and store runs should be quite decent (32GB RAM, 16 (logical) CPU cores).
It feels more like an IO issue but its really hard to tell without a trace to look at. As an alternative; can you maybe try to use store filtering in the thanos UI to see which store responds slowly?
Hey there.
A while ago I set up Prometheus at the university to monitor a local compute cluster. We want both, very detailed short term data (10s resolution for at least about the last week) and long term data (e.g for years, though in principle, a low resolution would be enough for that).
For starters, I wanted to visualise
node exporter
’s data, which I did via the Full Node Exporter Grafana dashboard.That was terribly slow. I mean, as soon as I plotted any range larger than a few days ... it took ages to plot, especially e.g. the "CPU Basic" panel.
And it was completely unusable when I plotted many "CPU Basic" panels for multiple instances.
I was told, that this is because I get data processed at 10s resolution, which is of course a lot.
Further, Thanos (in specific the compactor) was suggested as a solution.
I have now roughly the following setup:
sidecar
running, which writes data to some sshfs mountpoint (no funding at university for AWS or so :-P)receive
runsreceive
(which also writes to a sshfs mountpount, with the same remote storage as the other),query
,compact
andstore
running as well as Grafana.Thanos configurations are:
In principle it looks all good, compaction happens. I see both label set.
However, with Thanos it seems even slower than with Prometheus. Grafana nearly always runs into a timeout when I do e.g. "the last 30 days" (despite I have as of now only data for a bit over a week).
From the
query
UI it's not much better. I just queriednode_cpu_seconds_total{instance="someSingleNode"}
.... over 2 weeks, with deduplication and no resolution... which took 44446ms...That yields 512 series. That's a showstoppper when one wants to visualise e.g. the CPU utilisation of a whole cluster with n nodes in a dashboard.
Now I'm having a hard time to debug why it is low. The main VM, where query and store runs should be quite decent (32GB RAM, 16 (logical) CPU cores).
The network, especially also to the sshfs mountpoint should be nice.
I made a test, where I first dropped all fs caches on the node where the data actually lies, and then read on the
querey
node all data on the sshfs mountpoint via:That gave ~ 165 MiB/s of IO (via the sshfs)... a bit above 7mins for 73GiB.
What might however be bad, is IO to the local (virtual) disk on the VMs.... but I thought that wouldn't be really used that much for
query
(but only forreceive
)?When doing the query from above in the
query
UI with the 512 series repeatedly, it starts at 55s, then takes 38s, then 30s... but this is it.The data points that I can hover over in the UI are already pretty low res, between two points there's about 1h20min - despite "Only raw data" being selected... what?
The
--log.level=debug
entry for such query looks like:No real change when I select "Auto downsampling", though with that at least I see something in the request about the resolution:
Even with Max. 5m downsampling it doesn't get better (though I'd probably need Min to make sure that I only take downsampled data?).
When I go via the Grafana Full Node Exporter, and take the last 30 days, I get queries like:
So no mentioning of
max_resolution_window
or so. Seems it rather queries the full 10s resolution, which is of course bad.But even in Thanos
query
UI it's super slow.So I tried Store Filtering there and selected only Thanos
store
as source. Of course I don't get the data from the last 48h or so with that, but plotting time goes down to 8-9s ... IMO still far too slow, but might indicate that the big part of the above 50-30s could be the VMs virtual disk.However, even without that it's still 8s for not that much.
My
query
has that 3 endpoints:receive
(has only about the last 2d of data), same host asquery
store
, same host asquery
, all data, but only that older than ~2 dsidecar
(it's Prometheus has a retention of 36GB, which as of now is still everything), other hostWhen store filtering only single ones of them, the above query gives: ~1s for (1), 8-9s for (2) and 35-50s for (3).
Of course, all show different data, and right now only (3) has everything (but of course, will loose older data over time).
Ok, VM virtual disk IO may be bad... but why is it mostly good for (1) (on the local host), but so much worse for (3) (on another host)? I rather don't think network is an issue.
And why is it still quite bad for (2) (8-9s...).
Any ideas what I'm doing wrong?
Thanks, Chris.