openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.36k stars 713 forks source link

Support multiple child spans from the same thread, aka variable trace shapes #166

Closed codefromthecrypt closed 7 years ago

codefromthecrypt commented 8 years ago

Currently, Brave's supports fixed trace shapes defined below. The impact is generally found when people try to create more than one child span, and notice it overwrites the previous child.

https://github.com/openzipkin/brave/blob/3.6.0/brave-core/src/main/java/com/github/kristofa/brave/ServerSpanState.java https://github.com/openzipkin/brave/blob/3.6.0/brave-core/src/main/java/com/github/kristofa/brave/LocalSpanState.java https://github.com/openzipkin/brave/blob/3.6.0/brave-core/src/main/java/com/github/kristofa/brave/ClientSpanState.java

Support for variable span shapes has come up, usually in questions. Ex @galderz trying to trace infinispan, @michaelsembwever trying to make an open tracing bridge, and several gitter questions on https://gitter.im/openzipkin/zipkin.

Several options exist, and not all are either/or.

codefromthecrypt commented 8 years ago

Note it may be possible to write a custom state manager with the existing interfaces (eventhough I've not tried it). https://github.com/openzipkin/brave/blob/master/brave-core/src/test/java/com/github/kristofa/brave/example/TestServerClientAndLocalSpanStateCompilation.java

It may also be possible to work around this issue by creating scoped brave instances. Ex. if you are going to make multiple child spans, create a new brave in a try-finally, and for each of them set a current server span before running your task.

codefromthecrypt commented 8 years ago

ps not sure what the capabilities of this are, but maybe @hyleung has addresses this?

https://github.com/hyleung/ratpack-zipkin/blob/master/ratpack-zipkin/src/main/java/ratpack/zipkin/internal/RatpackServerClientLocalSpanState.java

hyleung commented 8 years ago

@adriancole that class is just a ServiceClientAndLocalSpanState implementation backed by Ratpack's registry (the default ThreadLocal implementation won't work for Ratpack). I don't believe it would address this issue - at least if i'm understanding it correctly.

codefromthecrypt commented 8 years ago

@hyleung ok just checking

kristofa commented 8 years ago

Thanks for creating this issue. This is definitely something that needs to be tackled. I'm in favor of changing the underlying model. Continuing the work on the 'Continuation Local Storage' and using that sounds like a good plan.

eirslett commented 8 years ago

Brave 4 might be based on a common, Java-based continuation local storage/context API that could hopefully be shared by Finagle? I envision we could port Finagle's tracer API to Java, and then build Brave's API as porcelain on top of that (for backwards/forwards-compatibility)

uschi2000 commented 8 years ago

Any more movement on this? Would love to see support for multiple local spans. @markelliot @jkozlowski

jkozlowski commented 8 years ago

Is anyone working on this? If not might have a crack this week?

codefromthecrypt commented 8 years ago

looks like you are working on it :) Personally, I'll be out for 2 weeks starting tonight, but others here can likely help with the review.

codefromthecrypt commented 8 years ago

fyi @jkozlowski is out on holiday now, but is planning to give this a crack when he gets back in a couple weeks

codefromthecrypt commented 8 years ago

may have an answer here! after a while of nagging, looks like grpc will extract their propagatable context as a dep-free library https://github.com/grpc/grpc-java/pull/2226 cc @ewhauser

Jazzepi commented 7 years ago

@adriancole I really need this for a project. We have long running processes that do several deep dives internally without calling out to a microservice. Creating nested local spans is a must for us. Is there any way I can help? Or any advice you can give on how I can pull this in? Is there a commit on a branch that I could cherry pick and build locally that makes this work? I'm okay with working with a beta implementation.

codefromthecrypt commented 7 years ago

Can you try latest snapshot and this commit? Release is blocked, but we can unblock it if you find this works. https://github.com/openzipkin/brave/commit/c97129e70168762665ff3ced3cfdd022536b600c

Above is a great workaround, but we will keep this issue open until theres a long-term supportable api.

Jazzepi commented 7 years ago

@adriancole I actually did just that earlier today. I hacked on a small Jetty instance (outside of our main app) so I could test how it's behaving. Looks good to me!

Rebuilt our main application using that version and I'm starting to integrate it. Haven't run into anything bad so far.

This was the result of the main use case I was interested in which shows a web tier application with several nestings, and nested siblings. By the way the sibling with nested 1 is mislabeled. It should be sibling with nested 2.

image

codefromthecrypt commented 7 years ago

InheritableServerClientAndLocalSpanState going out in Brave 3.11 now

Jazzepi commented 7 years ago

@adriancole Danke! Working like a charm.

<3 <3 <3

uschi2000 commented 7 years ago

I'm struggling to understand how to use this thing. Trying to do something like this:

        MyService client = createBravifiedFeignClient(...); // internally uses a ClientTracer with a BraveOkHttpRequestResponseInterceptor

        ServerClientAndLocalSpanState state = new InheritableServerClientAndLocalSpanState(Endpoint.create("test", 1, 0));
        Brave brave = new Brave.Builder(state)
                .spanCollector(spanCollector)
                .traceSampler(Sampler.ALWAYS_SAMPLE)
                .build();
        LocalTracer tracer = brave.localTracer();
        tracer.startNewSpan("component", "operation");
        try {
            client.someMethod();
        } finally {
            tracer.finishSpan(42);
        }

I am seeing to traces logged, but they have different trace ids. I'd expect the local span to wrap the "client" span, no? How do I connect the two?

codefromthecrypt commented 7 years ago

hi robert, I think it is fair enough we should open a different issue on the topic you mentioned, as this one is getting sprawly. fair?

On Tue, Sep 13, 2016 at 6:43 AM, Robert Fink notifications@github.com wrote:

I'm struggling to understand how to use this thing. Trying to do something like this:

    MyService client = createBravifiedFeignClient(...); // internally uses a ClientTracer with a BraveOkHttpRequestResponseInterceptor

    ServerClientAndLocalSpanState state = new InheritableServerClientAndLocalSpanState(Endpoint.create("test", 1, 0));
    Brave brave = new Brave.Builder(state)
            .spanCollector(spanCollector)
            .traceSampler(Sampler.ALWAYS_SAMPLE)
            .build();
    LocalTracer tracer = brave.localTracer();
    tracer.startNewSpan("component", "operation");
    try {
        client.someMethod();
    } finally {
        tracer.finishSpan(42);
    }

I am seeing to traces logged, but they have different trace ids. I'd expect the local span to wrap the "client" span, no? How do I connect the two?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-246518901, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61_jR1vs5kHCmTkyN9kPZeGIllGloks5qpdV-gaJpZM4IV2U9 .

uschi2000 commented 7 years ago

yep

On Mon, Sep 12, 2016 at 4:27 PM Adrian Cole notifications@github.com wrote:

hi robert, I think it is fair enough we should open a different issue on the topic you mentioned, as this one is getting sprawly. fair?

On Tue, Sep 13, 2016 at 6:43 AM, Robert Fink notifications@github.com wrote:

I'm struggling to understand how to use this thing. Trying to do something like this:

MyService client = createBravifiedFeignClient(...); // internally uses a ClientTracer with a BraveOkHttpRequestResponseInterceptor

ServerClientAndLocalSpanState state = new InheritableServerClientAndLocalSpanState(Endpoint.create("test", 1, 0)); Brave brave = new Brave.Builder(state) .spanCollector(spanCollector) .traceSampler(Sampler.ALWAYS_SAMPLE) .build(); LocalTracer tracer = brave.localTracer(); tracer.startNewSpan("component", "operation"); try { client.someMethod(); } finally { tracer.finishSpan(42); }

I am seeing to traces logged, but they have different trace ids. I'd expect the local span to wrap the "client" span, no? How do I connect the two?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-246518901, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAD61_jR1vs5kHCmTkyN9kPZeGIllGloks5qpdV-gaJpZM4IV2U9

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-246527247, or mute the thread https://github.com/notifications/unsubscribe-auth/AGOdwcWijl4Yfh3Rw-4HOht9kymiXM2wks5qpd_FgaJpZM4IV2U9 .

codefromthecrypt commented 7 years ago

The design should make commonly requested things easy. For example, a single-method call to get the span id (regardless of whether it is a client or server or otherwise span). This implies there's a single object holding the "current span" as opposed to slots by type.

https://github.com/openzipkin/brave/issues/150#issuecomment-247883226

codefromthecrypt commented 7 years ago

aside for some interested in the general space of context design. In gRPC, they've noticed that the storage aspect of the context should be pluggable (similar to how SLF4J or log4j2 work). https://github.com/grpc/grpc-java/pull/2461

codefromthecrypt commented 7 years ago

fixing the root problem should be possible in brave 4 #180

codefromthecrypt commented 7 years ago

fyi all. This issue is still open because Brave 4 does not yet have a replacement api for in-process propagation of thread state. I summarized next steps here https://github.com/openzipkin/brave/issues/150#issuecomment-274387954

codefromthecrypt commented 7 years ago

I looked at all the tools I could find around this, not just our code and discussions, but also instrumentation projects like htrace, spring cloud sleuth, opentracing and google cloud trace. I also looked at the generic baggage spec and the underlying implementation of log propagation libraries like log4j2. I've also interviewed many folks, most recently @bogdandrutu

Here's the working premises...

To split things further into responsibilities..

Key deliverables from there are..

What's the role of the instrumentation developer?

The instrumentation developer is responsible for starting and stopping spans, calling tracer.withCurrentContext whenever code leaves theirs. If it is an asynchronous library, their job may also imply bespoke hooks to extract the current context from someplace else, such as a stream or request object.

For example, a simple synchronous call can look like this:

// this is the api that makes sure spans in the same process are connected
TraceContext parent = tracer.currentContext();
Span span = parent == null ? tracer.newTrace() : tracer.newChild(parent);
addDetails(span, request).start();
// the "current" local here isn't actually used, it only facilitates try-with-resources
try (CloseableWithoutExceptions  current : tracer.withContext(span.context())) {
  return request.invoke();
} catch (Exception e ){
  addError(span, e);
  throw e;
} finally {
  span.finish();
}

Asynchronous code is problematic as work is often scheduled on a different thread than the one that created the span. The instrumentation developer is responsible for carrying the context until it can be appropriately attached to a thread doing work. They are also responsible for making this easy (and compatible with utilities that look for tracer.currentContext());

TraceContext currentContext = getFromExchange();
try (CloseableWithoutExceptions  current : tracer.withCurrentContext(context)) {
  return executor.submit(callable); // assumes the executor looks for Tracer.currentContext()
}

What's the role of the Tracer?

The tracer provides a default implementation of thread-local storage, and manages attaching and detaching a thread context to that in response to calls to Tracer.withCurrentContext. This may end up extensible. For example, the outer attach/detach applies the trace id, and a inner could copy that to a logging MVC context.

What's the role of the instrumentation user?

Assuming the developer of instrumentation did a good job, the user will have limited responsibility. They will configure thread pools or schedulers with instructions given to them.

In rare cases where they contribute directly to spans, they could use tracer.currentContext() (or possibly tracer.tag("foo") if we add it.

why not lambdas?!

The above pattern works with java 6 (even without try-with-resources, as you can do a close on finally). It is also more efficient than using a lambda (or an anonymous class) to attach and detach a span from a thread. It is also easier to debug than lambdas as it does not cause artificial elements in the trace.

cschneider commented 7 years ago

Try with resources sounds like a good replacement for lambdas. It is also easier to debug than lambdas as it does not cause artificial elements in the trace.

codefromthecrypt commented 7 years ago

@cschneider thx.. I'll add your note to the rationale!

codefromthecrypt commented 7 years ago

Here is the rationale of implementing Tracer.withCurrentContext as attach and revert on close. There are some notes about RxJava here, but they should be similar for other async libraries like Reactor (cc'ing @smaldini @simonbasle who might be interested)

attach/detach

attaching and detaching contexts have the current problems..

using inherited thread locals can result in arbitrary state being attached. For example, when a pool grows, state is copied from the thread that caused it to grow [1].

This can be tricky in async patterns where thread usage might not be well understood by the caller. For example java parallel streams or rxjava.

In many cases, the solution is a decorator which saves state at the time a task is scheduled, and applies that state when the task is invoked. One concern with this is how to handle state when the task exists normally or from an error.

apply state

In simplest case, you do not reset state when a task exits. However, other tasks could see that state and misinterpret it.

This works fine for a single-threaded worker executing non-reentrant tasks, as there's no risk of two tasks running simultaneously and overwriting eachother's state [2].

apply state and clear

You could unconditionally clear the state. However, if the decorator was called directly (ex on the main thread), as opposed to via a scheduler, it would corrupt the caller's state.

One way to handle this could be to check the thread id, and if the same thread created the decorator and invoked it.. do nothing.

apply revert

You could also save prior state and restore it following the task's execution. This would work regardless of which thread schedules the task. For example, if the schedule and invoke happen on the same calling thread, the prior state would be the same as the new state. This approach is the most expensive, but the most safe for general usage.

[1] http://blog.mabn.pl/2014/11/rxjava-logback-and-mdc-threadlocal.html [2] http://akarnokd.blogspot.my/2015/05/schedulers-part-1.html

smaldini commented 7 years ago

Adding @akarnokd. Its an interesting topic, it might be difficult especially with RxJava2 and Reactor3 that heavily favor event colocation execution, e.g. a merge operator can easily run the draining of concurrent source emissions, namely Work Stealing. Would that inherit the same context ?

The problem being that these reactive libraries use in full the "location-transparent" trait of functional lambdas. They also tend to materialize their state only in a deferred fashion (akka streams materialization, subscribe/lift rx and reactor). Ultimately this is in conflict with a practice that actually does bind the execution model with its thread local state to the operators and prevent these libs to make end-to-end chains of operator fully non blocking.

So we're left with 2 options, modifying reactive libraries to accept a generic context passing orthogonal to the event passing, or pass the context as an event. Interestingly thats the same issue than application servers that maintain little to no state for a request, but a micro, intra-jvm level.

Another thing to keep in mind, operators callbacks are encouraged to be non blocking unless you use an explicit subscribeOn(Scheduler)/publishOn(Scheduler) around it, so by definition, e.g. in a Spring 5 app, if everything is non blocking we don't even need to hand off processing from Netty or Tomcat (non blocking) because at first we are non blocking besides moving data from a queue to a callback to another queue. Spring 5 doesn't offer by default to run on a scheduler controller requests, so if we want to monitor lambda spring apps later we need to address the problem at Netty level (and its event loops). Then if we do that, would that make sense at this level, decorating event loops doesn't seem to be a great plan given the overhead and their non blocking nature (if anything blocks Netty loops, then you loose your client/server).

So in conclusion, if we still go the ThreadLocal route we need Spring apps to configure a Scheduler by default to apply it and consider the mentioned issues above. If we go the spring-session like pattern where we transport some key in the event (maybe in he Subscription itself that we could try to cast instead of the OnNext(T)), then frameworks like Spring or directly reactor/rxjava could auto add custom operator that intercepts the start and stop of a given flow.

codefromthecrypt commented 7 years ago

Adding @akarnokd https://github.com/akarnokd. Its an interesting topic, it might be difficult especially with RxJava2 and Reactor3 that heavily favor event colocation execution, e.g. a merge operator can easily run the draining of concurrent source emissions, namely Work Stealing. Would that inherit the same context ?

At the end of the day, the specific state we are talking about is trace identifiers, and their goal is to attribute cause. In reality cause is perhaps not cleanly split between the intent of the developer and scheduling nuance. I think the goal is primarily to model the developer's intent. So, to your question, I doubt that one would intenionally model a merge operator as the same span as an unrelated concurrent source emission. Since there's a one-to-one relationship with span and context, no I don't think they would inherit that context, and if they did it would be a leak in the instrumentation of that library, not a feature :P

The problem being that these reactive libraries use in full the "location-transparent" trait of functional lambdas. They also tend to materialize their state only in a deferred fashion (akka streams materialization, subscribe/lift rx and reactor).

You may not be able to take the same propagation approach used for one asynchronous library and use it for another. I have seen several similar RxJava scheduler hooks for zipkin, but I don't pretend to know if that applies equally to other libraries or later versions.. Simply not something I know well.

At Brave's level of abstraction, it can suggest patterns for well-understood things like synchronous code, executors, interceptors callbacks. I don't think it can yet suggest a pattern for reactive stuff, yet. In other words, there's an implicit prereq of connecting user code to its execution. On the way there, the link between a span (whenever that ends up created), and the execution of work cannot be broken.

So we're left with 2 options, modifying reactive libraries to accept a generic context passing orthogonal to the event passing, or pass the context as an event. Interestingly thats the same issue than application servers that maintain little to no state for a request, but a micro, intra-jvm level.

Maybe 3 (some sort of mapping approach which achieves the same), but honestly I have no idea all the options here. Would be nice if there was some sharing between the various reactive libraries such that even if the implementations are not the same, the patterns are understood.

Really appreciate your feedback (so quickly too!)

reta commented 7 years ago

@adriancole Finally found a time to read all the comments and digest your ideas on how to make the usage of tracer more straightforward. I am wondering if you already considered the idea to make Span or TraceContext to be AutoCloseable by default?

So to illustrate the idea based on your example, I may just write a bit more compact code (Span is returned directly by tracer or could be wrapped into contextual object as well, depending what you think is better):

final TraceContext parent = ...
try (final Span span: tracer.startTrace(parent)) {
  addDetails(span, request);
  return request.invoke();
} catch (Exception e ){
  addError(span, e);
  throw e;
} 

The assumptions are that tracer is smart enough to understand if new or child trace should be started (depending on parent argument which could be even omitted and taken from the current context, no need to ask the developer to do it explicitly). Once span is returned, the explicit closing may not be needed as try-with-resource is going to call the close function anyway.

There are certainly pros and cons for this approach but from developer's standpoint it may cover quite a lot of typical use cases. What do you think?

codefromthecrypt commented 7 years ago

@adriancole https://github.com/adriancole Finally found a time to read all the comments and digest your ideas on how to make the usage of tracer more straightforward.

hehe that was a lot of reading!

I am wondering if you already considered the idea to make Span or TraceContext to be AutoCloseable by default?

yes, it was, but was reverted as while making Span closeable looks idiomatic, it only helps if you are instrumenting a synchronous api (which is increasingly not the case). The more generic try/finally use case is attaching the span to a thread. This is needed regardless of synchronous or asynchronous code. here's the change that unhooks closeable from span in favor of nudging people towards attach/detach https://github.com/openzipkin/brave/pull/352

reta commented 7 years ago

Thanks @adriancole , indeed span's lifecycle goes well beyond the point when it was created, make sense to stick with more generic approach. I like your idea though to add finish (and probably attach / detach or similar) to spans themselves, instead of using tracer's finishSpan, I think it makes intention and context clear enough.

Do you think some simplifications around starting spans, picking parent from the current context, etc would be useful?

codefromthecrypt commented 7 years ago

@reta wrt simplicity, almost certainly! Can you phrase simplifications in terms of before/after in the examples put "What's the role of the instrumentation developer?" in this comment? https://github.com/openzipkin/brave/issues/166#issuecomment-277634798

for implementation details of concerns like implicit span, MDC etc, let's comment on #356

Some of this might need to be tentative as we need to apply it to all instrumentation (ex #300). Changes applied to all instrumentation should hopefully quantify ease and re-use a bit.

smaldini commented 7 years ago

Just to let folks know, we are working on a context passing protocol in Reactor specifically with Zipkin support in mind and as an alternative to ThreadLocal context passing. In essence we will be able to pass in an immutable context a potentially shared reference of a Span you can interact with to enrich (and soon in other lifecycle events such as termination of a flow for close span). We are introducing specific operators and API, mainly contextualize to inject attributes like: https://gist.github.com/smaldini/d407cc9f9a1d13bbde6dcd93166df8c7

codefromthecrypt commented 7 years ago

Update. I'm starting to blend this in. one thing I noticed important is the relationship of Noop to the "current span" concept. The relationship of propagating "just the context" and noop is that noop is span.sampled = false. I will add this to implementation notes as I know some (ex sleuth) add trace IDs to logging context even when the span isn't sampled for reporting to zipkin.

codefromthecrypt commented 7 years ago

here's a first cut for the impatient (thanks to @bogdandrutu for lots of conversation and design sessions)

https://github.com/openzipkin/brave/pull/369

I'll be fleshing out the description and rebasing the instrumentation PR over this in the next day or two

codefromthecrypt commented 7 years ago

I made some notes about "current span" practice here in an old google doc which we started a couple years ago. Mainly because many of the points are not brave specific. If you have any insights or additions please add to it. https://docs.google.com/document/d/16byriP7jCi2xmLf8IveTTy5ttoGFgd9hn8hg_QkfMTU/edit#bookmark=id.l0k5cgprdeh5

codefromthecrypt commented 7 years ago

update: I've rebased just about all instrumentation and did a lot of test backfilling. There's a couple more days work as I want to try a couple async frameworks we haven't used yet. plus I want to tighten the bolts on the "strict" impl which will complain during tests if you do things like leak contexts or try to close a scope from a thread you didn't open it on. I don't see any major changes, just being a bit defensive before merging it in.

codefromthecrypt commented 7 years ago

final lap:

so one of the last mile issues is dealing with j.u.concurrent types like ExecutorService needed by existing instrumentation (sadly yes.. a lot of code needs ExecutorService not Executor)

I've thought a good bit on this, and think the following is the best course of action.

So, in code, you'd do stuff like..

CurrentTraceContext currentTraceContext = new GrpcCurrentTraceContext();

// when configuring something being instrumented...
restClient.setExecutorService(currentTraceContext.wrap(realExecutorService));

// or if you need to make your own callable for whatever reason
tracedCallable = currentTraceContext.wrap(callable);

The above would have the same behavior as brave 3's code which is to save off the current context prior to scheduling a task. When the task runs, the saved context is placed in scope and that scope is closed on exit of the task. Most of the code and tests will come directly from brave 3's ones, ensuring we end up with parity along with the new extensibility. cc @cschneider @schlosna @reta

reta commented 7 years ago

@adriancole I think this is very convenient and extensible approach, the context should be responsible for wrapping Callable / Runnable into something traceable. I am wondering, why do you need to wrap Executor / ExecutorService? Just to simplify parent span propagation?

codefromthecrypt commented 7 years ago

Thanks! yeah I just added this and it seems to work fine in the instrumentation pull request. If sounds good I will probably just address any nits and merge after backporting tests from brave 3.

https://github.com/openzipkin/brave/pull/369/commits/dde3b76c4d1caa7b7e790a4230ea52b1fadc4ede

On Sat, Apr 1, 2017 at 9:01 PM, Andriy Redko notifications@github.com wrote:

@adriancole https://github.com/adriancole I think this is very convenient and extensible approach, the context should be responsible for wrapping Callable / Runnable into something traceable. I am wondering, why do you need to wrap Executor / ExecutorService? Just to simplify parent span propagation?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-290918756, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61yIiRr2bx4i9TRnD_7JpG7qiA58Tks5rrkqdgaJpZM4IV2U9 .

codefromthecrypt commented 7 years ago

unit tests, readme etc all ready to go https://github.com/openzipkin/brave/pull/369

This issue will close shortly, and be released as Brave 4.1 ahead of backporting all instrumentation to use it. The reason for this is that eventhough all instrumentation can use the new api, there are other concerns beyond propagation to look at when redoing instrumentation. For example, we haven't looked closely at how the auto-tagging works.

By releasing the "current span" api independently, others like CXF, ratpack etc can start using the api while we tighten up details on #350 which relies on this Api, but also relies on other things.

codefromthecrypt commented 7 years ago

This is out in Brave 4.1 https://github.com/openzipkin/brave/tree/master/brave#current-span