perf: optimize tree processing

The PR optimizes the way samples are aggregated and symbolized:

For large datasets exceeding 16K samples, implement a sparse set to accumulate samples. Due to the "dense" nature of stack trace identifiers, memory usage remains efficient. Additionally, this approach naturally orders the samples by their stack trace identifiers.
Instead of building the resulting tree from scratch for large datasets, utilize the stored parent pointer tree as a prototype. This significantly reduces the number of memory accesses, improving overall efficiency. The downside of this approach is that truncation may become more "invasive" at the leaves: the source tree is built from locations, not functions. Therefore, the "other" stub may include siblings' samples. In practice, the impact is moderate and can be mitigated by increasing the max nodes limit (while still being more efficient: compare 8K and 64K benchmarks). A proper solution to this problem is to store function stack traces in addition to the tree of locations.
Specialize min heap implementation (int64) to avoid allocations caused by the interface conversion.

Benchmark of the SelectMergeByStacktraces query on the real data set that causes performance issues in the current implementation (~1K profiles, ~10K samples each):

    before    │                after                │
    sec/op    │   sec/op     vs base                │
 3599.9m ± 2%   661.8m ± 1%  -81.62% (p=0.000 n=10)

    before     │                after                │
     B/op      │     B/op      vs base               │
 1054.5Mi ± 1%   999.9Mi ± 3%  -5.18% (p=0.000 n=10)

   before    │                after                │
  allocs/op  │  allocs/op   vs base                │
 494.4k ± 0%   421.2k ± 0%  -14.80% (p=0.000 n=10)

Most of the allocations are made in parquet decoding and reconstruction of the symbolic information (locations, functions, mappings, and strings). This is addressed in https://github.com/grafana/pyroscope/pull/3138.

This is quite close to the synthetic benchmarks we have:

                                  │    before    │                after                │
                                  │    sec/op    │   sec/op     vs base                │
_Resolver_ResolveTree_Small/0-10     261.4µ ± 0%   259.2µ ± 1%   -0.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10    303.9µ ± 1%   258.7µ ± 1%  -14.86% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10    275.4µ ± 1%   272.2µ ± 1%   -1.17% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10       479.4m ± 4%   457.3m ± 2%   -4.61% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     362.16m ± 2%   47.96m ± 1%  -86.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    368.45m ± 1%   68.41m ± 6%  -81.43% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10     372.8m ± 3%   101.6m ± 3%  -72.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10     378.9m ± 7%   162.0m ± 4%  -57.26% (p=0.000 n=10)

                                  │    before     │                after                 │
                                  │     B/op      │     B/op      vs base                │
_Resolver_ResolveTree_Small/0-10    148.4Ki ±  0%   148.1Ki ± 0%   -0.19% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   152.1Ki ±  0%   142.6Ki ± 0%   -6.25% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   231.9Ki ±  0%   231.6Ki ± 0%   -0.15% (p=0.001 n=10)
_Resolver_ResolveTree_Big/0-10      198.0Mi ± 40%   202.7Mi ± 0%   +2.39% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     90.04Mi ± 88%   46.86Mi ± 0%        ~ (p=0.134 n=10)
_Resolver_ResolveTree_Big/16K-10    90.77Mi ± 87%   55.95Mi ± 0%        ~ (p=0.136 n=10)
_Resolver_ResolveTree_Big/32K-10    92.41Mi ±  0%   70.96Mi ± 0%  -23.21% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    95.62Mi ±  0%   88.80Mi ± 0%   -7.13% (p=0.001 n=10)

                                  │   before    │                after                │
                                  │  allocs/op  │  allocs/op   vs base                │
_Resolver_ResolveTree_Small/0-10    2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   3.224k ± 0%   2.036k ± 0%  -36.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10      3.047M ± 0%   3.042M ± 0%   -0.15% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     28.85k ± 0%   19.70k ± 0%  -31.72% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    47.91k ± 0%   39.01k ± 0%  -18.58% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10    87.01k ± 0%   76.53k ± 0%  -12.05% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    164.9k ± 0%   152.7k ± 0%   -7.42% (p=0.000 n=10)

Note that the optimisations mostly concern large data sets (significantly bigger than our ResolveTree_Big) and some of them are not used in the case when the max nodes limit is not specified (defaults to 16K), or is too big, as this would be inefficient because of the increased memory consumption. Therefore, e.g., Resolver_ResolveTree_Big/0 (no truncation) does not perform significantly better.

grafana / pyroscope

perf: optimize tree processing #3386