AsyncReporter/SpanHandler: make `queuedMaxBytes=0` disable pre-flight size checks

codefromthecrypt commented 5 months ago

Feature

When reporter was first created, it had a rare feature that allows you to limit both the count (queuedMaxSpans) and total size of spans (queuedMaxBytes) in the backlog. This isn't perfectly correct with regards to memory size, but can be adjusted to limit memory, nevertheless.

Some sites will not be terribly concerned about the max pending bytes and by avoiding that can save time in reporting.

Let's allow a setting queuedMaxBytes=0 to disable guards in AsyncReporter and AsyncSpanHandler. When this is in-place, a ByteBoundedQueue is no longer used.

Instead, a normal bounded queue is, that drops when the count is higher than queuedMaxSpans. All other work is done on the AsyncSpanReporter thread, including dropping if the size of a queued span is larger than messageMaxBytes

Rationale

This should solve https://github.com/openzipkin/brave/issues/1426 by reducing synchronized time when end is called in brave to negligible.

Now, a large span will take microsecond to size (in json or proto). This size function was very carefully done to be performant.

MutableSpanBytesEncoderBenchmarks.encode_bigClientSpan_proto:p0.95                    sample             0.916            us/op
MutableSpanBytesEncoderBenchmarks.encode_bigClientSpan_proto:p0.99                    sample             1.042            us/op
MutableSpanBytesEncoderBenchmarks.encode_bigClientSpan_proto:p0.999                   sample            10.496            us/op
MutableSpanBytesEncoderBenchmarks.encode_bigClientSpan_proto:p0.9999                  sample            19.380            us/op
MutableSpanBytesEncoderBenchmarks.encode_bigClientSpan_proto:p1.00                    sample           488.960            us/op

More time would be in edge cases, or if someone is reporting 100pct (despite that being very much unadvised), or otherwise creating a large amount of data at a high rate.

The spanhandler.end function is under a synchronized lock in brave, someone using virtual threads and -Djdk.tracePinnedThreads=full may see a warning. It is unknown what amount of time triggers this.

Even if when this log has no affect on real-world performance, the topic is distracting and costs maintainer time to answer. Effort spent to minimize this kind of time, is also worth it.

Solely looking at a count backlog isn't as useful in memory limiting, but it is very cheap (faster to check). Also, I don't think any other reporter does memory bounds anyway, so it doesn't limit competitiveness of the reporter library.

The main thing is turning off size guards results in far less time under a lock, hopefully so minuscule pinned thread watchers can't detect it even under load.

Example Scenario

Mainly, this is all about not distracting those people experimenting or using VirtualThreads. Brave has synchronized code for a good reason, and it isn't harmful, except possibly due to work done when using async reporting here.

openzipkin / zipkin-reporter-java