stripe / veneur

A distributed, fault-tolerant pipeline for observability data
MIT License
1.73k stars 174 forks source link

Pool memory allocations #1016

Closed andrewa-stripe closed 1 year ago

andrewa-stripe commented 1 year ago

Summary

Adds a(n opt-in) pool for the serialization of SSFSpan's

Motivation

Saw veneur prevalent in performance profiles

Test plan


# before proto.Marshal
BenchmarkSerialization/UDP_plain_span_with_metrics-10             464810          2403 ns/op         176 B/op          2 allocs/op
BenchmarkSerialization/UDP_plain_span_no_metrics-10               541394          2488 ns/op          80 B/op          2 allocs/op
BenchmarkSerialization/UDP_plain_empty_span_with_metrics-10       536200          2497 ns/op         112 B/op          2 allocs/op

# span.Marshal (if #1018 is accepted)
BenchmarkSerialization/UDP_plain_span_with_metrics-10             511609          2526 ns/op         160 B/op          1 allocs/op
BenchmarkSerialization/UDP_plain_span_no_metrics-10               635616          1916 ns/op          64 B/op          1 allocs/op
BenchmarkSerialization/UDP_plain_empty_span_with_metrics-10       572904          2157 ns/op          96 B/op          1 allocs/op

# use Pool
BenchmarkSerialization/UDP_POOLED_plain_span_with_metrics-10      510517          2371 ns/op           0 B/op          0 allocs/op
BenchmarkSerialization/UDP_POOLED_plain_span_no_metrics-10        543577          1966 ns/op           0 B/op          0 allocs/op
BenchmarkSerialization/UDP_POOLED_plain_empty_span_with_metrics-10            502544          2111 ns/op           0 B/op          0 allocs/op

Rollout/monitoring/revert plan

andrewa-stripe commented 1 year ago

~previous comment removed, it was due to a bad merge~

go test -bench BenchmarkSerialization/UDP -benchmem -benchtime=1000000x -cpu 8

BenchmarkSerialization/UDP_plain_span_with_metrics-8              100000          2761 ns/op         160 B/op          1 allocs/op
BenchmarkSerialization/UDP_plain_span_no_metrics-8                100000          2578 ns/op          64 B/op          1 allocs/op
BenchmarkSerialization/UDP_plain_empty_span_with_metrics-8        100000          3174 ns/op          96 B/op          1 allocs/op
BenchmarkSerialization/UDP_POOLED_plain_span_with_metrics-8       100000          3483 ns/op           0 B/op          0 allocs/op
BenchmarkSerialization/UDP_POOLED_plain_span_no_metrics-8         100000          3229 ns/op           0 B/op          0 allocs/op
BenchmarkSerialization/UDP_POOLED_plain_empty_span_with_metrics-8             100000          3325 ns/op           0 B/op          0 allocs/op

versus the existing tests in master

BenchmarkSerialization/UDP_plain_span_with_metrics-8             1000000          2870 ns/op         160 B/op          1 allocs/op
BenchmarkSerialization/UDP_plain_span_no_metrics-8               1000000          2425 ns/op          64 B/op          1 allocs/op
BenchmarkSerialization/UDP_plain_empty_span_with_metrics-8       1000000          2467 ns/op          96 B/op          1 allocs/op

CPU profile shows that were are limited on UDP, so the now less "time" is not a concern (in the off-path we do an extra JMP on a bool). The memory allocation should be the focus for now.