lightstep / lightstep-tracer-go

The Lightstep distributed tracing library for Go
https://lightstep.com
MIT License
98 stars 54 forks source link

Only emit spans dropped within a status interval in the status event report #265

Closed MatthewDolan closed 3 years ago

MatthewDolan commented 3 years ago

R: @kayousterhout @codeboten

CC: @neena @akehlenbeck

Summary of Change

I discovered a bug in reporting dropped and errored spans. Specifically, if span reports are being rejected on the wire, we merge those reports (including the dropped & errored spans) back into the next report. This makes sense because we assume that the server hasn't successfully processed the report and we try again with the totality of the report.

Unfortunately, we also emit a status event with each report attempt (successful or unsuccessful). On unsuccessful reporting attempts, we are emitting a status event with dropped spans and then merging that counter back into the next report attempt. This means that when the next report attempt completes (whether successfully or unsuccessfully) that counter will be reported again (double counting). This is exacerbated during an outage scenario because each failed report ads another multiplier to counts (2x to 3x to 4x counting etc.). In certain outage situations, I observed a 30x multiplier on the actual count of dropped spans to the reported count of dropped spans.

This pull request resolves the issue by tracking what has so far been reported and then only reporting new dropped spans in each new status event update which makes it much easier for the caller to then just increment a counter with that update. It shouldn't impact how these counts are being reported to Lightstep because the code is still preserving that full value and serializing it into the report. (There is still a chance that on the receiving side these counts will be double counted but the resolution of that is more difficult because of typical distributed system fault tolerance issues so resolving that double counting isn't a part of this pull request).