grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code
https://grafana.com/oss/pyroscope/
GNU Affero General Public License v3.0
9.97k stars 597 forks source link

perf: tune async batch iterator #3358

Closed kolesnikovae closed 3 months ago

kolesnikovae commented 3 months ago

It's been observed that the async batch iterator we're using for fetching Parquet rows might be using too much memory. Note the query.CloneParquetValues call under iter.(*AsyncBatchIterator[...]).fillBuffer in the flame graph:

image

The problem manifests when the query hits downsampled (aggregated) profiles: a row may contain thousands of values. Another factor is the misalignment of the query split interval and the block duration: each sub-range is processed independently, with its own iterator, thus multiplying the memory requirement.

In practice, a large buffer is not required here, as it is only needed to avoid waiting for fetches from individual columns by reading the data from them ahead of time. In turn, each of the columns has it's own "read ahead" buffer, which should minimaize blocking of the top-level iterator.

One way to solve the problem is to make the iterator work with size in bytes and have a predictable memory footprint. In the PR, I reduce the default buffer size and change the allocation strategy to use the new slices.Grow function.

kolesnikovae commented 3 months ago

The change has helped to reduce the pressure, however, I still think that the memory usage is too wasteful. I'm thinking about removing the buffer altogether – my experiments have shown no significant impact on the performance

image

There are more spots that need to be optmimized:

  1. The map that accumulates samples. We could replace it with a slice and use direct indexing.
  2. The intermediate tree (before truncation). This one is tricky: we need to find a way to trim stack traces before building the tree.
  3. In-memory symdb partitions. We should disable chunking for stack traces.