Huge amounts of memory usage when collecting data from Thanos/Prometheus

thanos-community / obslytics

Tools and Services allowing seamless usage of Observability data from Prometheus, Thanos, Cortex, M3DB, Loki and more!

Apache License 2.0

32 stars 8 forks source link

Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

Open 4n4nd opened 4 years ago

4n4nd commented 4 years ago

When collecting data from Thanos, the tool uses a lot of memory. I think this is because all the metric data is downloaded and stored in memory while it is being processed and stored to the backend storage.

I know the Thanos Store API does not support streaming of data, but maybe we could chunk our queries somehow. Same for Prometheus, Remote Read API does support streaming of data but it is not available in the upstream client yet.

bwplotka commented 4 years ago

What huge means? (:

Performance optimiations is never ending story. We need to find out if:

Perf improvements are worth it, prioritize it. Is tool unusable? How much memory is a "good" amount of memory to be used vs data fetched? (: Can you give some numbers, repro? (:

This will help us to find out how much we can improve :hugs:

4n4nd commented 4 years ago

@gmfrasca do you have more specific numbers for the workflows?

gmfrasca commented 4 years ago

@4n4nd when running on 1-hour chunks, on a few of the larger metrics (subscription_labels, for example) we were getting OOMKilled even with 24GiB RAM allocated.

We initially allocated 12GiB which saw the majority of the metrics getting OOMKilled, which we've worked around by reducing our chunk sizes by half (2 runs per hour, 30-minutes per chunk), but still saw the larger metrics fail at 12GiB. Due to hardware limitations we unfortunately cannot sustain running pods with 24GiB, but also can't reduce our chunk size much further as we would risk a job 'collision' (for example, with a 4 jobs/hour, 15-minute cadence, the XX:30 job possibly could end up in parallel with the XX:15 job, and would compete for resources that we may not have)

I hope this is helpful, but please let me know if there are any other details I can provide. Thanks!

gmfrasca commented 3 years ago

Hey @bwplotka! With the holiday season concluding, just wanted bump/signal boost this issue. Would you have any insights on how we can approach alleviating the memory utilization here?

In the short term I don't think we have a great deal of flexibility allocating more hardware resources, which leaves us in a tough spot for expanding the number of metrics we can retrieve, or adding features to the current pipelines. It would be great if we could track down and optimize the heavy mem usage areas in-code to get around that, if at all possible.

bwplotka commented 3 years ago

Yea, The way format is to obtain profiles and figure out the problematic spot (:

Feel free to read more about it here: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/

pprof endpoint should be already available.

If this is a tool not a service running we could use simple code like this (you can just copy this function) or import https://github.com/efficientgo/tools/blob/main/performance/pkg/profiles/profile.go#L17