perf: optimize tree resolve

The PR is aimed at optimizing the way we build a tree from stack traces at read.

Specifically, I replaced the model.Tree with model.StacktaceTree that we've optimized for pprof symbolication.

Here is a comparison of queries that build a flame graph that includes 3M unique stack traces, and 10M nodes (real-case dataset) in a singe thread:

Before             9615096375 ns/op 4244433576 B/op 26203002 allocs/op
After              3626939417 ns/op 1399753624 B/op   497736 allocs/op

The improvement is appx. 60% (9.6s -> 3.8s). In the test, I used the default 16k max nodes limit. However, even without the limit (tested on ~1-10M node trees), the performance is sustainable:

goos: darwin
goarch: arm64
pkg: github.com/grafana/pyroscope/pkg/phlaredb/symdb
Benchmark_Resolver_ResolveTree_Big/0-10     3    364309236 ns/op    93448242 B/op       6059 allocs/op
Benchmark_Resolver_ResolveTree_Big/8K-10    3    383892556 ns/op    94366349 B/op     106298 allocs/op
Benchmark_Resolver_ResolveTree_Big/16K-10   3    382851180 ns/op    94969890 B/op     184351 allocs/op
Benchmark_Resolver_ResolveTree_Big/32K-10   3    389259681 ns/op    96135072 B/op     313091 allocs/op
Benchmark_Resolver_ResolveTree_Big/64K-10   3    413904764 ns/op    98080141 B/op     523357 allocs/op

Further optimization of tree symbolication seems to be complicated (although, there is an opportunity to decrease cache misses and improve branch prediction). I'd rather consider optimizing stack trace selection: instead of resolving each stack trace, it makes sense to trim insignificant ones based on their weight (various heuristics could be employed), if the stack trace set exceeds a threshold (e.g., cardinality > 1-2M). This will introduce some inaccuracy in the flame graphs, but I believe this is acceptable:

grafana / pyroscope

perf: optimize tree resolve #3348