pkolaczk / latte

Latency Tester for Apache Cassandra
Apache License 2.0
176 stars 19 forks source link

Histograms stored in samples take too much memory during long runs #67

Closed vponomaryov closed 2 months ago

vponomaryov commented 6 months ago

Screenshot from 2024-03-29 14-19-56 On the screenshot above we see memory utilization of 2 nodes which are used for running latte. Memory utilization grew up to 10Gb for 3 hours of uptime on each of the nodes.

Debugged a bit locally and observed that memory leaks happen during each event of sampling. My observation is that memory utilization is directly related to the made operations during a sampling period.

pkolaczk commented 3 months ago

Latte collects some data (summaries) in memory and processes them afterwards; so some memory growth is expected. However if it is 10 GB, that's a lot. My first guess would be histograms...

vponomaryov commented 3 months ago

Latte collects some data (summaries) in memory and processes them afterwards; so some memory growth is expected. However if it is 10 GB, that's a lot. My first guess would be histograms...

The root cause is the constantly growing number of stored samples which then get used for the report generation. And yes, those include histograms.

So, the proper solution, I think, would be to process samples on the go and store only the processed single summary result which gets updated with each sampling step.

pkolaczk commented 3 months ago

The histograms in the samples are compressed and then stored in the report for future use, e.g. for producing HdrHistogram logs (latte hdr command). So I think this needs a user-facing change. Instead of saving the histograms to the report the option for producing HDR logs should be tied directly to run and the histograms should be optionally streamed to a separate file while running. This would have an additional benefit of making the reports smaller and faster to load, which is now even more important after I added the latte list command.

As a temporary workaround, you can control the interval at which latte takes samples. For very long runs there is probably no point in capturing them every second. Fewer samples = less memory overhead.