grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.03k stars 522 forks source link

Speedup DistinctValue collector and exit early for ingesters #4104

Closed electron0zero closed 1 month ago

electron0zero commented 1 month ago

What this PR does:

  1. makes the DistinctValue collector go faster by not tracking diff in cases where we don't use the diff, and reduce the locking and few other operations.
  2. fix a bug in collection, where we were not stopping early. now we check and stop early when we hit the limits of the collector. this greatly improves the performance in cases where users are querying high cardinality queries.

now with early stop, we will bail out and return results instead of collecting the values after hitting the limit.

how fast? collector is 47% faster when used without any limits and 60% - 80% faster when used with limit and we hit the limits, and around 90-100% faster when we early stop when limit is hit

Benchmarks

main (no early stop) vs fast collector (with early stop) ```text goos: darwin goarch: arm64 pkg: github.com/grafana/tempo/pkg/collector │ BenchmarkCollect_main_no_early_stop.txt │ BenchmarkCollect_fast_collect_early_stop.txt │ │ sec/op │ sec/op vs base │ Collect/limit:0 450.4m ± 1% 238.0m ± 1% -47.16% (p=0.000 n=10+20) Collect/limit:100000 43940.1µ ± 0% 573.7µ ± 1% -98.69% (p=0.000 n=10+20) Collect/limit:1000000 67.444m ± 0% 7.364m ± 1% -89.08% (p=0.000 n=10+20) Collect/limit:10000000 269.8m ± 2% 106.4m ± 1% -60.54% (p=0.000 n=10+20) geomean 137.8m 18.09m -86.87% │ BenchmarkCollect_main_no_early_stop.txt │ BenchmarkCollect_fast_collect_early_stop.txt │ │ B/op │ B/op vs base │ Collect/limit:0 310.1Mi ± 0% 155.0Mi ± 0% -50.00% (p=0.000 n=10+20) Collect/limit:100000 1262.5Ki ± 0% 631.3Ki ± 0% -49.99% (p=0.000 n=10+20) Collect/limit:1000000 19.377Mi ± 0% 9.689Mi ± 0% -50.00% (p=0.000 n=10+20) Collect/limit:10000000 155.04Mi ± 0% 77.52Mi ± 0% -50.00% (p=0.000 n=10+20) geomean 32.74Mi 16.37Mi -50.00% │ BenchmarkCollect_main_no_early_stop.txt │ BenchmarkCollect_fast_collect_early_stop.txt │ │ allocs/op │ allocs/op vs base │ Collect/limit:0 76.50k ± 0% 38.25k ± 0% -50.00% (p=0.000 n=10+20) Collect/limit:100000 214.0 ± 1% 108.0 ± 0% -49.53% (n=10+20) Collect/limit:1000000 4.556k ± 0% 2.280k ± 0% -49.96% (n=10+20) Collect/limit:10000000 38.15k ± 0% 19.07k ± 0% -50.01% (p=0.000 n=10+20) geomean 7.304k 3.661k -49.88% ```
main (with early stop) vs fast collector (with early stop) ```text goos: darwin goarch: arm64 pkg: github.com/grafana/tempo/pkg/collector │ BenchmarkCollect_main_early_stop.txt │ BenchmarkCollect_fast_collect_early_stop.txt │ │ sec/op │ sec/op vs base │ Collect/limit:0 452.0m ± 1% 238.0m ± 1% -47.35% (p=0.000 n=10+20) Collect/limit:100000 1160.0µ ± 0% 573.7µ ± 1% -50.54% (p=0.000 n=10+20) Collect/limit:1000000 15.063m ± 0% 7.364m ± 1% -51.11% (p=0.000 n=10+20) Collect/limit:10000000 201.0m ± 13% 106.4m ± 1% -47.05% (p=0.000 n=10+20) geomean 35.50m 18.09m -49.05% │ BenchmarkCollect_main_early_stop.txt │ BenchmarkCollect_fast_collect_early_stop.txt │ │ B/op │ B/op vs base │ Collect/limit:0 310.1Mi ± 0% 155.0Mi ± 0% -50.00% (p=0.000 n=10+20) Collect/limit:100000 1262.6Ki ± 0% 631.3Ki ± 0% -49.99% (p=0.000 n=10+20) Collect/limit:1000000 19.377Mi ± 0% 9.689Mi ± 0% -50.00% (p=0.000 n=10+20) Collect/limit:10000000 155.04Mi ± 0% 77.52Mi ± 0% -50.00% (p=0.000 n=10+20) geomean 32.74Mi 16.37Mi -50.00% │ BenchmarkCollect_main_early_stop.txt │ BenchmarkCollect_fast_collect_early_stop.txt │ │ allocs/op │ allocs/op vs base │ Collect/limit:0 76.50k ± 0% 38.25k ± 0% -50.01% (p=0.000 n=10+20) Collect/limit:100000 214.0 ± 0% 108.0 ± 0% -49.53% (n=10+20) Collect/limit:1000000 4.556k ± 0% 2.280k ± 0% -49.96% (n=10+20) Collect/limit:10000000 38.15k ± 0% 19.07k ± 0% -50.02% (p=0.000 n=10+20) geomean 7.304k 3.661k -49.88% ```
main (no early stop) vs fast collector (no early stop) ```text goos: darwin goarch: arm64 pkg: github.com/grafana/tempo/pkg/collector │ BenchmarkCollect_main_no_early_stop.txt │ BenchmarkCollect_fast_collect_no_early_stop.txt │ │ sec/op │ sec/op vs base │ Collect/limit:0 450.4m ± 1% 236.1m ± 2% -47.57% (p=0.000 n=10) Collect/limit:100000 43.940m ± 0% 6.143m ± 0% -86.02% (p=0.000 n=10) Collect/limit:1000000 67.44m ± 0% 12.70m ± 1% -81.17% (p=0.000 n=10) Collect/limit:10000000 269.8m ± 2% 107.6m ± 1% -60.11% (p=0.000 n=10) geomean 137.8m 37.52m -72.76% │ BenchmarkCollect_main_no_early_stop.txt │ BenchmarkCollect_fast_collect_no_early_stop.txt │ │ B/op │ B/op vs base │ Collect/limit:0 310.1Mi ± 0% 155.0Mi ± 0% -50.00% (p=0.000 n=10) Collect/limit:100000 1262.5Ki ± 0% 631.3Ki ± 0% -50.00% (p=0.000 n=10) Collect/limit:1000000 19.377Mi ± 0% 9.689Mi ± 0% -50.00% (p=0.000 n=10) Collect/limit:10000000 155.04Mi ± 0% 77.52Mi ± 0% -50.00% (p=0.000 n=10) geomean 32.74Mi 16.37Mi -50.00% │ BenchmarkCollect_main_no_early_stop.txt │ BenchmarkCollect_fast_collect_no_early_stop.txt │ │ allocs/op │ allocs/op vs base │ Collect/limit:0 76.50k ± 0% 38.25k ± 0% -49.99% (p=0.000 n=10) Collect/limit:100000 214.0 ± 1% 108.0 ± 0% -49.53% (p=0.000 n=10) Collect/limit:1000000 4.556k ± 0% 2.280k ± 0% -49.96% (p=0.000 n=10) Collect/limit:10000000 38.15k ± 0% 19.07k ± 0% -50.02% (p=0.000 n=10) geomean 7.304k 3.661k -49.88% ```
main (with early stop) vs fast collector (no early stop) ```text goos: darwin goarch: arm64 pkg: github.com/grafana/tempo/pkg/collector │ BenchmarkCollect_main_early_stop.txt │ BenchmarkCollect_fast_collect_no_early_stop.txt │ │ sec/op │ sec/op vs base │ Collect/limit:0 452.0m ± 1% 236.1m ± 2% -47.76% (p=0.000 n=10) Collect/limit:100000 1.160m ± 0% 6.143m ± 0% +429.57% (p=0.000 n=10) Collect/limit:1000000 15.06m ± 0% 12.70m ± 1% -15.68% (p=0.000 n=10) Collect/limit:10000000 201.0m ± 13% 107.6m ± 1% -46.48% (p=0.000 n=10) geomean 35.50m 37.52m +5.71% │ BenchmarkCollect_main_early_stop.txt │ BenchmarkCollect_fast_collect_no_early_stop.txt │ │ B/op │ B/op vs base │ Collect/limit:0 310.1Mi ± 0% 155.0Mi ± 0% -50.00% (p=0.000 n=10) Collect/limit:100000 1262.6Ki ± 0% 631.3Ki ± 0% -50.00% (p=0.000 n=10) Collect/limit:1000000 19.377Mi ± 0% 9.689Mi ± 0% -50.00% (p=0.000 n=10) Collect/limit:10000000 155.04Mi ± 0% 77.52Mi ± 0% -50.00% (p=0.000 n=10) geomean 32.74Mi 16.37Mi -50.00% │ BenchmarkCollect_main_early_stop.txt │ BenchmarkCollect_fast_collect_no_early_stop.txt │ │ allocs/op │ allocs/op vs base │ Collect/limit:0 76.50k ± 0% 38.25k ± 0% -50.00% (p=0.000 n=10) Collect/limit:100000 214.0 ± 0% 108.0 ± 0% -49.53% (p=0.000 n=10) Collect/limit:1000000 4.556k ± 0% 2.280k ± 0% -49.96% (p=0.000 n=10) Collect/limit:10000000 38.15k ± 0% 19.07k ± 0% -50.02% (p=0.000 n=10) geomean 7.304k 3.661k -49.88% ```

Checklist