grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code
https://grafana.com/oss/pyroscope/
GNU Affero General Public License v3.0
9.64k stars 574 forks source link

perf: optimize tree processing #3386

Closed kolesnikovae closed 1 week ago

kolesnikovae commented 1 week ago

The PR optimizes the way samples are aggregated and symbolized:

Benchmark of the SelectMergeByStacktraces query on the real data set that causes performance issues in the current implementation (~1K profiles, ~10K samples each):

    before    │                after                │
    sec/op    │   sec/op     vs base                │
 3599.9m ± 2%   661.8m ± 1%  -81.62% (p=0.000 n=10)

    before     │                after                │
     B/op      │     B/op      vs base               │
 1054.5Mi ± 1%   999.9Mi ± 3%  -5.18% (p=0.000 n=10)

   before    │                after                │
  allocs/op  │  allocs/op   vs base                │
 494.4k ± 0%   421.2k ± 0%  -14.80% (p=0.000 n=10)

Most of the allocations are made in parquet decoding and reconstruction of the symbolic information (locations, functions, mappings, and strings). This is addressed in https://github.com/grafana/pyroscope/pull/3138.

This is quite close to the synthetic benchmarks we have:

                                  │    before    │                after                │
                                  │    sec/op    │   sec/op     vs base                │
_Resolver_ResolveTree_Small/0-10     261.4µ ± 0%   259.2µ ± 1%   -0.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10    303.9µ ± 1%   258.7µ ± 1%  -14.86% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10    275.4µ ± 1%   272.2µ ± 1%   -1.17% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10       479.4m ± 4%   457.3m ± 2%   -4.61% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     362.16m ± 2%   47.96m ± 1%  -86.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    368.45m ± 1%   68.41m ± 6%  -81.43% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10     372.8m ± 3%   101.6m ± 3%  -72.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10     378.9m ± 7%   162.0m ± 4%  -57.26% (p=0.000 n=10)

                                  │    before     │                after                 │
                                  │     B/op      │     B/op      vs base                │
_Resolver_ResolveTree_Small/0-10    148.4Ki ±  0%   148.1Ki ± 0%   -0.19% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   152.1Ki ±  0%   142.6Ki ± 0%   -6.25% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   231.9Ki ±  0%   231.6Ki ± 0%   -0.15% (p=0.001 n=10)
_Resolver_ResolveTree_Big/0-10      198.0Mi ± 40%   202.7Mi ± 0%   +2.39% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     90.04Mi ± 88%   46.86Mi ± 0%        ~ (p=0.134 n=10)
_Resolver_ResolveTree_Big/16K-10    90.77Mi ± 87%   55.95Mi ± 0%        ~ (p=0.136 n=10)
_Resolver_ResolveTree_Big/32K-10    92.41Mi ±  0%   70.96Mi ± 0%  -23.21% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    95.62Mi ±  0%   88.80Mi ± 0%   -7.13% (p=0.001 n=10)

                                  │   before    │                after                │
                                  │  allocs/op  │  allocs/op   vs base                │
_Resolver_ResolveTree_Small/0-10    2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   3.224k ± 0%   2.036k ± 0%  -36.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10      3.047M ± 0%   3.042M ± 0%   -0.15% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     28.85k ± 0%   19.70k ± 0%  -31.72% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    47.91k ± 0%   39.01k ± 0%  -18.58% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10    87.01k ± 0%   76.53k ± 0%  -12.05% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    164.9k ± 0%   152.7k ± 0%   -7.42% (p=0.000 n=10)

Note that the optimisations mostly concern large data sets (significantly bigger than our ResolveTree_Big) and some of them are not used in the case when the max nodes limit is not specified (defaults to 16K), or is too big, as this would be inefficient because of the increased memory consumption. Therefore, e.g., Resolver_ResolveTree_Big/0 (no truncation) does not perform significantly better.