open-telemetry / opentelemetry-ebpf-profiler

The production-scale datacenter profiler (C/C++, Go, Rust, Python, Java, NodeJS, .NET, PHP, Ruby, Perl, ...)

Apache License 2.0

2.28k stars 240 forks source link

Benchmarking changes of the wire protocol #110

Open rockdaboot opened 1 month ago

rockdaboot commented 1 month ago

When making changes of the wire protocol, we should take into account the effect on CPU usage, memory usage and network bandwidth. For this we need some tooling for doing (nearly) reproducible benchmarks.

Roughly, my thoughts are

record data passed to the Reporter
replay previously recorded data (with the same order and timing!)
record the uncompressed on-wire messages (protobuf blobs)
a benchmark Go tool that does compression and decompression of the protobuf messages (Go because we want to measure the Go implementations of the compressors)
a python tool to generate diagrams / tables from the results of the Go tool

Profiling_-_Protocol_Benchmarking5

The recorded data can be replayed multiple times, e.g. with and without a protocol implementation change, to allow comparisons of the change's effects.

florianl commented 1 month ago

When establishing and creating the OTel Profiling protocol, @petethepig invested noticeable time and effort in benchmarks - see https://github.com/petethepig/opentelemetry-collector/pull/1. He also documented changes and potential options with https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit?gid=1732807979#gid=1732807979.

It might be worth considering building on this existing work.

athre0z commented 1 month ago

A simpler approach that @christos68k and I have been testing with previously is to build two profiling agents with two protocols that you want to compare, then running them at the same time on the same machine while applying some heavy workload and recording the sum of all message sizes. Sampling won't interrupt exactly the same traces in both agents, but if you run it for an hour or so it should statistically give you a pretty good estimate. From previous experience of looking at differential flamegraphs of two agents running on the same machine, I'd expect the error to be in the realm of 0.5 - 1% with that approach. It's arguably more difficult to reproduce for other reviewers than with @petethepig's approach or the one that you are describing in this issue here.

rockdaboot commented 4 weeks ago

120 is a PoC for the ideas outlines in the issue description.