openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.36k stars 713 forks source link

Excessive allocations ExtraFactory.decorate #1421

Open kilink opened 9 months ago

kilink commented 9 months ago

I frequently see ExtraFactory.decorate at the top of the list of allocation hotspots when I profile our services. Here is an example from a 1 minute profile:

image

It looks to me like this is caused by excessive resizing of the extras ArrayList. If we look at ExtraFactory.decorate, I believe this happens because of this line:

TraceContext.Builder builder = context.toBuilder().clearExtra().addExtra(claimed);

And then the subsequent loop that adds Extra instances one at a time; it's an interaction between the above code and Lists.ensureMutable; we always pass Collections.emptyList() to Lists.ensureMutable, which causes the ArrayList to get initialized with an initialCapacity of 0. Due to how Java's ArrayList calculates the new capacity when growing a list, for sizes this small we end up reallocating A LOT. Here is a simple example that demonstrates that:

TraceContext.Builder builder = TraceContext.newBuilder().clearExtra();
builder.addExtra("foo"); // ArrayList.grow called, capacity = 0, minCapacity = 1, newCapacity = 1
builder.addExtra("bar"); // ArrayList.grow called, capacity = 1, minCapacity = 2, newCapacity = 2
builder.addExtra("baz"); // ArrayList.grow called, capacity = 2, minCapacity = 3, newCapacity = 3
builder.addExtra("quux"); // ArrayList.grow called, capacity = 3, minCapacity = 4, newCapacity = 4
builder.addExtra("blah"); // ArrayList.grow called, capacity = 4, minCapacity = 5, newCapacity = 6
builder.addExtra("blahblah"); // ArrayList.grow not called, capacity = 6
builder.addExtra("blahblahblah"); // ArrayList.grow called, capacity=6, minCapacity = 7, newCapacity = 9

Basically, initializing an ArrayList with a small initial capacity (such as zero), and then adding items one-by-one will trigger excessive copying, since the first 5 element insertions trigger a resize. At that point we'e allocated in total 5 arrays, with a combined capacity of 15; we would have been better off initializing the original ArrayList with more headroom. One approach is to do just that in Lists.ensureMutable:

 public static <E> List<E> ensureMutable(List<E> list) {
  if (list instanceof ArrayList) return list;
  List<E> mutable = new ArrayList<>(Math.max(6, list.size())); // some suitable default, ArrayList uses 10
  mutable.addAll(list);
  return mutable;
}

Another option is to try to size the list more accurately in ExtraFactory.decorate, by using extraSize or something similar.

codefromthecrypt commented 9 months ago

hmm, so sounds like there are a lot of modifications to "extra" after the initial request, huh.. what's causing that?

do you feel up to optimizing this? maybe improve or write an alternate to BaggagePropagationBenchmarks and show the before and after?

note: we'll need to stay at bytecode level 6. Also, because use of extra is often one item, let's try not to initialize a default of a larger array. Basically, try your best to keep those using static or rarely modified baggage fast pathed.

kilink commented 8 months ago

@codefromthecrypt when I look into it, we aren't adding new extra instances here, it's typically 3 items from the root span. The issue is when creating a child span, the data has to be copied over. It's possible we could just disallow mutations in child spans, I would need to explore that, as I'm not sure it's even required. However, it still seems to be a problem for the case where mutation is needed.

codefromthecrypt commented 8 months ago

ok, let's step back from the internal type. The only public type is BaggagePropagation, so that's what we would optimize against. Let's see how this can create the overhead and work from there in a benchmark.

Can you add a benchmark that represents what you are doing to BaggagePropagationBenchmarks? As you'll notice, BaggagePropagation is carefully designed to discourage and limit dynamic fields. It wasn't necessarily designed for what you are doing.

While possible to do .addExtra(x).addExtra(y) this so far doesn't seem like the expected use. For example, extra isn't like a baggage field, it is used to hold "all baggage fields" The expected use in other words, is a container for all data in one plugin. We have a sophisticated example of this https://github.com/openzipkin-contrib/zipkin-secondary-sampling/blob/b28af82625712589137d92ee0978cac65d25eabd/brave/src/main/java/brave/secondary_sampling/SecondarySamplingDecisions.java#L20

In other words it isn't typical to change the amount of "extras" in a context. Maybe you can more specific about why this is done (what framework or exact use case, and why this ends up using BaggagePropagation), as possibly the call site itself to be optimized to contain its values vs randomly add 'extras.

In summary, to progress this, please show exactly what is going on, the ultimate use case, what code is doing this (is it in spring?). We have to optimize for the majority of people and sometimes that means changing how something is used instead of making everyone have to allocate or adding risk to one of the most easy to break parts of tracing (propagation)

codefromthecrypt commented 8 months ago

meanwhile I improved docs as I noticed that implications of "extra" were not easy to find in one place, so I consolidated and clarified existing notes https://github.com/openzipkin/brave/pull/1424

codefromthecrypt commented 8 months ago

So far, what I am guessing is happening is something is using our internal MapExtraFactory, not via BaggageFieldsPropagation, which is the only non-internal type. In doing so, it may be behaving in a way we didn't write the code to handle. That or something is making many instances of baggage propagation when only one should be there. Possibly something is excessively decorating the trace? Everyone here are volunteers, so the hours I have spent so far take free time or time away from work. So, we have to be very precise in next steps to keep this affordable to continue hunting mutually.

NEXT STEP:

Identify what code is in the stack that ends up calling decorate many times. If you are using spring, please fork an example to reproduce what's going on.

kilink commented 8 months ago

While possible to do .addExtra(x).addExtra(y) this so far doesn't seem like the expected use.

We aren't doing that directly, that is happening in ExtraFactory decorate, because of how the code is written to copy the extra list.

In other words it isn't typical to change the amount of "extras" in a context. Maybe you can more specific about why this is done (what framework or exact use case, and why this ends up using BaggagePropagation), as possibly the call site itself to be optimized to contain its values vs randomly add 'extras.

This is internal code using Spring Boot + Micrometer, with a custom Propagation implementation similar to the secondary sampling example you linked to. Just like that one, it wraps a delegate Propagation Factory, in this case BaggagePropagation. Here is what the extras list looks like in our app after decorate is called:

That BaggageFields.Extra instance gets added because we are using remote propagation for the baggage field, e.g. BaggagePropagationConfig.SingleBaggageField.remote(BaggageField.create("foo"). We aren't randomly adding extras, it's the copying of the above three instances that shows up in allocation profiles, in part due to the number of spans. It very well may be we are doing something suboptimal, and obviously if possible I will push to get that fixed at the call-site. However, my approach is to chase down such issues from both ends. If the library code can result in such behavior, and it's easy to fix or avoid in the library code, it seems to do little harm to fix it there as well, even if the exact use case wasn't envisioned or anticipated.

Since that secondary sampling example you linked to most closely resembles what our internal code is doing, here is an example that reproduces the issue:

Propagation.Factory baggagePropagation = BaggagePropagation.newFactoryBuilder(B3Propagation.newFactoryBuilder().build())
                .add(BaggagePropagationConfig.SingleBaggageField.remote(BaggageField.create("foo")))
                .build();
SecondarySampling sampler = SecondarySampling.newBuilder()
                .propagationFactory(baggagePropagation)
                .secondarySampler((o, req) -> true)
                .build();
Tracing tracing = Tracing.newBuilder()
                .propagationFactory(sampler)
                .build();
        HttpServerRequest request = new HttpServerRequest() {
            @Override
            public String method() {
                return "POST";
            }

            @Override
            public String path() {
                return "/foo";
            }

            @Override
            public String url() {
                return "http://example.com/foo";
            }

            @Override
            public String header(String name) {
                return null;
            }

            @Override
            public Object unwrap() {
                return this;
            }
        };
TraceContextOrSamplingFlags flags = tracing.propagation().extractor(HttpServerRequest::header)
                .extract(request);
Span span = tracing.tracer().nextSpan(flags);
tracing.tracer().newChild(span.context()); // extras copied ..
tracing.tracer().newChild(span.context()); // extras copied ..
codefromthecrypt commented 8 months ago

thanks for the context. I think the best next step is to choose the webflux example or similar with the exact versions of micrometer you are using because I know there is some inefficient code involved. We need to put things into perspective, and that's why I suggest eventually getting to JMH as you may find these allocations not in any way dominating, as other instrumentation is causing excessive overhead. For example, our b3 doesn't allocate as much as most, and when we are talking about allocation about propagation and fixing a small thing, but not a big thing.. well makes less sense.

NEXT STEP:

can you make a fork or copy of brave-example and reproduce what you are doing and the versions of boot and micrometer involved. We need to be precise if we care about this, and I personally will only help if this is done.

codefromthecrypt commented 8 months ago

FWIW, what I am concerned with is that it has something to do with this https://github.com/spring-projects/spring-boot/blob/1a907bf47e1dbcc3b5c8e41b47b9b40bbc42740b/spring-boot-project/spring-boot-actuator-autoconfigure/src/main/java/org/springframework/boot/actuate/autoconfigure/tracing/CompositePropagationFactory.java#L78

Spring have paid employees, rather than rely on volunteer time we should get together the actual scenario. that's the whole reason the brave-example project was made (to conserve volunteer time)

codefromthecrypt commented 8 months ago

fyi for your own experiments, here's how we have measured this in the past

All of this combines with other things like propagation, and also the cardinality of "extra" is very important. For example, if we measure creation of an array without being in context of other things, it is hard to prioritize change. It is also hard to make the case to add overhead to everyone. Finally, it may be that what's being done could be scoped differently such that no extras are copied at all!

To get a gist of how we conserved time with benchmarks, you can do something like this in EndToEndBenchmarks

-        + ".*(tracedCorrelatedServer_get|tracedServer_get)$")
+        + ".*(tracedBaggageServer_get|tracedServer_get)$")

then run it. I usually pick p99 and gc.alloc.rate.norm.

Then, for example you can see that using baggage how we did indeed has overhead (lots of warnings about this) and in the controlled example, the allocation goes up by a bit over 5pct.

Benchmark                                                        Mode     Cnt      Score     Error   Units
EndToEndBenchmarks.tracedBaggageServer_get:gc.alloc.rate.norm  sample      15  30960.100 ± 213.076    B/op
EndToEndBenchmarks.tracedBaggageServer_get:p0.99               sample            178.688             us/op
EndToEndBenchmarks.tracedServer_get:gc.alloc.rate.norm         sample      15  29354.070 ±  43.676    B/op
EndToEndBenchmarks.tracedServer_get:p0.99                      sample            180.480             us/op

This will be extremely different in boot for reasons mentioned including how it operates a lot differently. Once things are put together realistically, first thing I would ask is to supply your own bean of propagation factory. Then, I would ask if your example is scoped exactly how it is used (data added to your custom thing is indeed mutated vs being a constant the whole request).

Finally, similar to the secondary-sampling example, if there is anything that produces a cardinality change in extra, especially if that is dominating. At most it should be a couple element array, and again if there is no reason to copy we can talk about alternatives vs doing any copy at all. This is why we need to get exactly what you are doing down and in context.

I have lost too much time shooting in the dark in the past, and that's why I'm asking you to not model abstractly, but implement exactly what you are doing and we can work from there. If your propagation thing is a trade secret, replicate its interactions without exposing your business logic, in such a way that it creates the same amount of data in the same way per request.

hope this helps!

codefromthecrypt commented 8 months ago

also added this so you can see how changes might affect others. There may be a way to improve this without affecting others, and as always a change is welcome to that regard. Just, sacrificing others requires a lot of rationale, particularly in places we have added a lot of benchmark work. https://github.com/openzipkin/brave/pull/1425