Closed codefromthecrypt closed 7 years ago
Note it may be possible to write a custom state manager with the existing interfaces (eventhough I've not tried it). https://github.com/openzipkin/brave/blob/master/brave-core/src/test/java/com/github/kristofa/brave/example/TestServerClientAndLocalSpanStateCompilation.java
It may also be possible to work around this issue by creating scoped brave instances. Ex. if you are going to make multiple child spans, create a new brave in a try-finally, and for each of them set a current server span before running your task.
ps not sure what the capabilities of this are, but maybe @hyleung has addresses this?
@adriancole that class is just a ServiceClientAndLocalSpanState
implementation backed by Ratpack's registry (the default ThreadLocal implementation won't work for Ratpack). I don't believe it would address this issue - at least if i'm understanding it correctly.
@hyleung ok just checking
Thanks for creating this issue. This is definitely something that needs to be tackled. I'm in favor of changing the underlying model. Continuing the work on the 'Continuation Local Storage' and using that sounds like a good plan.
Brave 4 might be based on a common, Java-based continuation local storage/context API that could hopefully be shared by Finagle? I envision we could port Finagle's tracer API to Java, and then build Brave's API as porcelain on top of that (for backwards/forwards-compatibility)
Any more movement on this? Would love to see support for multiple local spans. @markelliot @jkozlowski
Is anyone working on this? If not might have a crack this week?
looks like you are working on it :) Personally, I'll be out for 2 weeks starting tonight, but others here can likely help with the review.
fyi @jkozlowski is out on holiday now, but is planning to give this a crack when he gets back in a couple weeks
may have an answer here! after a while of nagging, looks like grpc will extract their propagatable context as a dep-free library https://github.com/grpc/grpc-java/pull/2226 cc @ewhauser
@adriancole I really need this for a project. We have long running processes that do several deep dives internally without calling out to a microservice. Creating nested local spans is a must for us. Is there any way I can help? Or any advice you can give on how I can pull this in? Is there a commit on a branch that I could cherry pick and build locally that makes this work? I'm okay with working with a beta implementation.
Can you try latest snapshot and this commit? Release is blocked, but we can unblock it if you find this works. https://github.com/openzipkin/brave/commit/c97129e70168762665ff3ced3cfdd022536b600c
Above is a great workaround, but we will keep this issue open until theres a long-term supportable api.
@adriancole I actually did just that earlier today. I hacked on a small Jetty instance (outside of our main app) so I could test how it's behaving. Looks good to me!
Rebuilt our main application using that version and I'm starting to integrate it. Haven't run into anything bad so far.
This was the result of the main use case I was interested in which shows a web tier application with several nestings, and nested siblings. By the way the sibling with nested 1 is mislabeled. It should be sibling with nested 2.
InheritableServerClientAndLocalSpanState going out in Brave 3.11 now
@adriancole Danke! Working like a charm.
<3 <3 <3
I'm struggling to understand how to use this thing. Trying to do something like this:
MyService client = createBravifiedFeignClient(...); // internally uses a ClientTracer with a BraveOkHttpRequestResponseInterceptor
ServerClientAndLocalSpanState state = new InheritableServerClientAndLocalSpanState(Endpoint.create("test", 1, 0));
Brave brave = new Brave.Builder(state)
.spanCollector(spanCollector)
.traceSampler(Sampler.ALWAYS_SAMPLE)
.build();
LocalTracer tracer = brave.localTracer();
tracer.startNewSpan("component", "operation");
try {
client.someMethod();
} finally {
tracer.finishSpan(42);
}
I am seeing to traces logged, but they have different trace ids. I'd expect the local span to wrap the "client" span, no? How do I connect the two?
hi robert, I think it is fair enough we should open a different issue on the topic you mentioned, as this one is getting sprawly. fair?
On Tue, Sep 13, 2016 at 6:43 AM, Robert Fink notifications@github.com wrote:
I'm struggling to understand how to use this thing. Trying to do something like this:
MyService client = createBravifiedFeignClient(...); // internally uses a ClientTracer with a BraveOkHttpRequestResponseInterceptor ServerClientAndLocalSpanState state = new InheritableServerClientAndLocalSpanState(Endpoint.create("test", 1, 0)); Brave brave = new Brave.Builder(state) .spanCollector(spanCollector) .traceSampler(Sampler.ALWAYS_SAMPLE) .build(); LocalTracer tracer = brave.localTracer(); tracer.startNewSpan("component", "operation"); try { client.someMethod(); } finally { tracer.finishSpan(42); }
I am seeing to traces logged, but they have different trace ids. I'd expect the local span to wrap the "client" span, no? How do I connect the two?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-246518901, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61_jR1vs5kHCmTkyN9kPZeGIllGloks5qpdV-gaJpZM4IV2U9 .
yep
On Mon, Sep 12, 2016 at 4:27 PM Adrian Cole notifications@github.com wrote:
hi robert, I think it is fair enough we should open a different issue on the topic you mentioned, as this one is getting sprawly. fair?
On Tue, Sep 13, 2016 at 6:43 AM, Robert Fink notifications@github.com wrote:
I'm struggling to understand how to use this thing. Trying to do something like this:
MyService client = createBravifiedFeignClient(...); // internally uses a ClientTracer with a BraveOkHttpRequestResponseInterceptor
ServerClientAndLocalSpanState state = new InheritableServerClientAndLocalSpanState(Endpoint.create("test", 1, 0)); Brave brave = new Brave.Builder(state) .spanCollector(spanCollector) .traceSampler(Sampler.ALWAYS_SAMPLE) .build(); LocalTracer tracer = brave.localTracer(); tracer.startNewSpan("component", "operation"); try { client.someMethod(); } finally { tracer.finishSpan(42); }
I am seeing to traces logged, but they have different trace ids. I'd expect the local span to wrap the "client" span, no? How do I connect the two?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-246518901, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAD61_jR1vs5kHCmTkyN9kPZeGIllGloks5qpdV-gaJpZM4IV2U9
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-246527247, or mute the thread https://github.com/notifications/unsubscribe-auth/AGOdwcWijl4Yfh3Rw-4HOht9kymiXM2wks5qpd_FgaJpZM4IV2U9 .
The design should make commonly requested things easy. For example, a single-method call to get the span id (regardless of whether it is a client or server or otherwise span). This implies there's a single object holding the "current span" as opposed to slots by type.
https://github.com/openzipkin/brave/issues/150#issuecomment-247883226
aside for some interested in the general space of context design. In gRPC, they've noticed that the storage aspect of the context should be pluggable (similar to how SLF4J or log4j2 work). https://github.com/grpc/grpc-java/pull/2461
fixing the root problem should be possible in brave 4 #180
fyi all. This issue is still open because Brave 4 does not yet have a replacement api for in-process propagation of thread state. I summarized next steps here https://github.com/openzipkin/brave/issues/150#issuecomment-274387954
I looked at all the tools I could find around this, not just our code and discussions, but also instrumentation projects like htrace, spring cloud sleuth, opentracing and google cloud trace. I also looked at the generic baggage spec and the underlying implementation of log propagation libraries like log4j2. I've also interviewed many folks, most recently @bogdandrutu
Here's the working premises...
To split things further into responsibilities..
Key deliverables from there are..
brave.Tracer.currentContext()
(similar to com.github.kristofa.brave.CurrentSpan.get())brave.Tracer.tag(String, String)
and brave.Tracer.annotate(String)
tracer.toSpan(tracer.currentContext()).tag("foo")
CloseableWithoutExceptions brave.Tracer.withCurrentContext(TraceContext context)
The instrumentation developer is responsible for starting and stopping spans, calling tracer.withCurrentContext
whenever code leaves theirs. If it is an asynchronous library,
their job may also imply bespoke hooks to extract the current context from someplace else,
such as a stream or request object.
For example, a simple synchronous call can look like this:
// this is the api that makes sure spans in the same process are connected
TraceContext parent = tracer.currentContext();
Span span = parent == null ? tracer.newTrace() : tracer.newChild(parent);
addDetails(span, request).start();
// the "current" local here isn't actually used, it only facilitates try-with-resources
try (CloseableWithoutExceptions current : tracer.withContext(span.context())) {
return request.invoke();
} catch (Exception e ){
addError(span, e);
throw e;
} finally {
span.finish();
}
Asynchronous code is problematic as work is often scheduled on a different thread than the one that created the span. The instrumentation developer is responsible for carrying the context until it can be appropriately attached to a thread doing work. They are also responsible for making this easy (and compatible with utilities that look for tracer.currentContext()
);
TraceContext currentContext = getFromExchange();
try (CloseableWithoutExceptions current : tracer.withCurrentContext(context)) {
return executor.submit(callable); // assumes the executor looks for Tracer.currentContext()
}
The tracer provides a default implementation of thread-local storage, and manages attaching and detaching a thread context to that in response to calls to Tracer.withCurrentContext
. This may end up extensible. For example, the outer attach/detach applies the trace id, and a inner could copy that to a logging MVC context.
Assuming the developer of instrumentation did a good job, the user will have limited responsibility. They will configure thread pools or schedulers with instructions given to them.
In rare cases where they contribute directly to spans, they could use tracer.currentContext()
(or possibly tracer.tag("foo")
if we add it.
The above pattern works with java 6 (even without try-with-resources, as you can do a close on finally). It is also more efficient than using a lambda (or an anonymous class) to attach and detach a span from a thread. It is also easier to debug than lambdas as it does not cause artificial elements in the trace.
Try with resources sounds like a good replacement for lambdas. It is also easier to debug than lambdas as it does not cause artificial elements in the trace.
@cschneider thx.. I'll add your note to the rationale!
Here is the rationale of implementing Tracer.withCurrentContext as attach and revert on close. There are some notes about RxJava here, but they should be similar for other async libraries like Reactor (cc'ing @smaldini @simonbasle who might be interested)
attaching and detaching contexts have the current problems..
using inherited thread locals can result in arbitrary state being attached. For example, when a pool grows, state is copied from the thread that caused it to grow [1].
This can be tricky in async patterns where thread usage might not be well understood by the caller. For example java parallel streams or rxjava.
In many cases, the solution is a decorator which saves state at the time a task is scheduled, and applies that state when the task is invoked. One concern with this is how to handle state when the task exists normally or from an error.
In simplest case, you do not reset state when a task exits. However, other tasks could see that state and misinterpret it.
This works fine for a single-threaded worker executing non-reentrant tasks, as there's no risk of two tasks running simultaneously and overwriting eachother's state [2].
You could unconditionally clear the state. However, if the decorator was called directly (ex on the main thread), as opposed to via a scheduler, it would corrupt the caller's state.
One way to handle this could be to check the thread id, and if the same thread created the decorator and invoked it.. do nothing.
You could also save prior state and restore it following the task's execution. This would work regardless of which thread schedules the task. For example, if the schedule and invoke happen on the same calling thread, the prior state would be the same as the new state. This approach is the most expensive, but the most safe for general usage.
[1] http://blog.mabn.pl/2014/11/rxjava-logback-and-mdc-threadlocal.html [2] http://akarnokd.blogspot.my/2015/05/schedulers-part-1.html
Adding @akarnokd. Its an interesting topic, it might be difficult especially with RxJava2 and Reactor3 that heavily favor event colocation execution, e.g. a merge operator can easily run the draining of concurrent source emissions, namely Work Stealing. Would that inherit the same context ?
The problem being that these reactive libraries use in full the "location-transparent" trait of functional lambdas. They also tend to materialize their state only in a deferred fashion (akka streams materialization, subscribe/lift rx and reactor). Ultimately this is in conflict with a practice that actually does bind the execution model with its thread local state to the operators and prevent these libs to make end-to-end chains of operator fully non blocking.
So we're left with 2 options, modifying reactive libraries to accept a generic context passing orthogonal to the event passing, or pass the context as an event. Interestingly thats the same issue than application servers that maintain little to no state for a request, but a micro, intra-jvm level.
Another thing to keep in mind, operators callbacks are encouraged to be non blocking unless you use an explicit subscribeOn(Scheduler)/publishOn(Scheduler) around it, so by definition, e.g. in a Spring 5 app, if everything is non blocking we don't even need to hand off processing from Netty or Tomcat (non blocking) because at first we are non blocking besides moving data from a queue to a callback to another queue. Spring 5 doesn't offer by default to run on a scheduler controller requests, so if we want to monitor lambda spring apps later we need to address the problem at Netty level (and its event loops). Then if we do that, would that make sense at this level, decorating event loops doesn't seem to be a great plan given the overhead and their non blocking nature (if anything blocks Netty loops, then you loose your client/server).
So in conclusion, if we still go the ThreadLocal route we need Spring apps to configure a Scheduler by default to apply it and consider the mentioned issues above. If we go the spring-session like pattern where we transport some key in the event (maybe in he Subscription itself that we could try to cast instead of the OnNext(T)), then frameworks like Spring or directly reactor/rxjava could auto add custom operator that intercepts the start and stop of a given flow.
Adding @akarnokd https://github.com/akarnokd. Its an interesting topic, it might be difficult especially with RxJava2 and Reactor3 that heavily favor event colocation execution, e.g. a merge operator can easily run the draining of concurrent source emissions, namely Work Stealing. Would that inherit the same context ?
At the end of the day, the specific state we are talking about is trace identifiers, and their goal is to attribute cause. In reality cause is perhaps not cleanly split between the intent of the developer and scheduling nuance. I think the goal is primarily to model the developer's intent. So, to your question, I doubt that one would intenionally model a merge operator as the same span as an unrelated concurrent source emission. Since there's a one-to-one relationship with span and context, no I don't think they would inherit that context, and if they did it would be a leak in the instrumentation of that library, not a feature :P
The problem being that these reactive libraries use in full the "location-transparent" trait of functional lambdas. They also tend to materialize their state only in a deferred fashion (akka streams materialization, subscribe/lift rx and reactor).
You may not be able to take the same propagation approach used for one asynchronous library and use it for another. I have seen several similar RxJava scheduler hooks for zipkin, but I don't pretend to know if that applies equally to other libraries or later versions.. Simply not something I know well.
At Brave's level of abstraction, it can suggest patterns for well-understood things like synchronous code, executors, interceptors callbacks. I don't think it can yet suggest a pattern for reactive stuff, yet. In other words, there's an implicit prereq of connecting user code to its execution. On the way there, the link between a span (whenever that ends up created), and the execution of work cannot be broken.
So we're left with 2 options, modifying reactive libraries to accept a generic context passing orthogonal to the event passing, or pass the context as an event. Interestingly thats the same issue than application servers that maintain little to no state for a request, but a micro, intra-jvm level.
Maybe 3 (some sort of mapping approach which achieves the same), but honestly I have no idea all the options here. Would be nice if there was some sharing between the various reactive libraries such that even if the implementations are not the same, the patterns are understood.
Really appreciate your feedback (so quickly too!)
@adriancole Finally found a time to read all the comments and digest your ideas on how to make the usage of tracer more straightforward. I am wondering if you already considered the idea to make Span
or TraceContext
to be AutoCloseable
by default?
So to illustrate the idea based on your example, I may just write a bit more compact code (Span
is returned directly by tracer
or could be wrapped into contextual object as well, depending what you think is better):
final TraceContext parent = ...
try (final Span span: tracer.startTrace(parent)) {
addDetails(span, request);
return request.invoke();
} catch (Exception e ){
addError(span, e);
throw e;
}
The assumptions are that tracer
is smart enough to understand if new or child trace should be started (depending on parent
argument which could be even omitted and taken from the current context, no need to ask the developer to do it explicitly). Once span
is returned, the explicit closing may not be needed as try-with-resource is going to call the close
function anyway.
There are certainly pros and cons for this approach but from developer's standpoint it may cover quite a lot of typical use cases. What do you think?
@adriancole https://github.com/adriancole Finally found a time to read all the comments and digest your ideas on how to make the usage of tracer more straightforward.
hehe that was a lot of reading!
I am wondering if you already considered the idea to make Span or TraceContext to be AutoCloseable by default?
yes, it was, but was reverted as while making Span closeable looks idiomatic, it only helps if you are instrumenting a synchronous api (which is increasingly not the case). The more generic try/finally use case is attaching the span to a thread. This is needed regardless of synchronous or asynchronous code. here's the change that unhooks closeable from span in favor of nudging people towards attach/detach https://github.com/openzipkin/brave/pull/352
Thanks @adriancole , indeed span's lifecycle goes well beyond the point when it was created, make sense to stick with more generic approach. I like your idea though to add finish
(and probably attach
/ detach
or similar) to spans themselves, instead of using tracer's finishSpan
, I think it makes intention and context clear enough.
Do you think some simplifications around starting spans, picking parent from the current context, etc would be useful?
@reta wrt simplicity, almost certainly! Can you phrase simplifications in terms of before/after in the examples put "What's the role of the instrumentation developer?" in this comment? https://github.com/openzipkin/brave/issues/166#issuecomment-277634798
for implementation details of concerns like implicit span, MDC etc, let's comment on #356
Some of this might need to be tentative as we need to apply it to all instrumentation (ex #300). Changes applied to all instrumentation should hopefully quantify ease and re-use a bit.
Just to let folks know, we are working on a context passing protocol in Reactor specifically with Zipkin support in mind and as an alternative to ThreadLocal context passing. In essence we will be able to pass in an immutable context a potentially shared reference of a Span you can interact with to enrich (and soon in other lifecycle events such as termination of a flow for close span). We are introducing specific operators and API, mainly contextualize to inject attributes like: https://gist.github.com/smaldini/d407cc9f9a1d13bbde6dcd93166df8c7
Update. I'm starting to blend this in. one thing I noticed important is the relationship of Noop to the "current span" concept. The relationship of propagating "just the context" and noop is that noop is span.sampled = false. I will add this to implementation notes as I know some (ex sleuth) add trace IDs to logging context even when the span isn't sampled for reporting to zipkin.
here's a first cut for the impatient (thanks to @bogdandrutu for lots of conversation and design sessions)
https://github.com/openzipkin/brave/pull/369
I'll be fleshing out the description and rebasing the instrumentation PR over this in the next day or two
I made some notes about "current span" practice here in an old google doc which we started a couple years ago. Mainly because many of the points are not brave specific. If you have any insights or additions please add to it. https://docs.google.com/document/d/16byriP7jCi2xmLf8IveTTy5ttoGFgd9hn8hg_QkfMTU/edit#bookmark=id.l0k5cgprdeh5
update: I've rebased just about all instrumentation and did a lot of test backfilling. There's a couple more days work as I want to try a couple async frameworks we haven't used yet. plus I want to tighten the bolts on the "strict" impl which will complain during tests if you do things like leak contexts or try to close a scope from a thread you didn't open it on. I don't see any major changes, just being a bit defensive before merging it in.
final lap:
so one of the last mile issues is dealing with j.u.concurrent types like ExecutorService needed by existing instrumentation (sadly yes.. a lot of code needs ExecutorService not Executor)
I've thought a good bit on this, and think the following is the best course of action.
BraveCallable
, add overridable support methods like wrap(Callable)
and wrap(ExecutorService)
So, in code, you'd do stuff like..
CurrentTraceContext currentTraceContext = new GrpcCurrentTraceContext();
// when configuring something being instrumented...
restClient.setExecutorService(currentTraceContext.wrap(realExecutorService));
// or if you need to make your own callable for whatever reason
tracedCallable = currentTraceContext.wrap(callable);
The above would have the same behavior as brave 3's code which is to save off the current context prior to scheduling a task. When the task runs, the saved context is placed in scope and that scope is closed on exit of the task. Most of the code and tests will come directly from brave 3's ones, ensuring we end up with parity along with the new extensibility. cc @cschneider @schlosna @reta
@adriancole I think this is very convenient and extensible approach, the context should be responsible for wrapping Callable / Runnable into something traceable. I am wondering, why do you need to wrap Executor
/ ExecutorService
? Just to simplify parent span propagation?
Thanks! yeah I just added this and it seems to work fine in the instrumentation pull request. If sounds good I will probably just address any nits and merge after backporting tests from brave 3.
https://github.com/openzipkin/brave/pull/369/commits/dde3b76c4d1caa7b7e790a4230ea52b1fadc4ede
On Sat, Apr 1, 2017 at 9:01 PM, Andriy Redko notifications@github.com wrote:
@adriancole https://github.com/adriancole I think this is very convenient and extensible approach, the context should be responsible for wrapping Callable / Runnable into something traceable. I am wondering, why do you need to wrap Executor / ExecutorService? Just to simplify parent span propagation?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/166#issuecomment-290918756, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61yIiRr2bx4i9TRnD_7JpG7qiA58Tks5rrkqdgaJpZM4IV2U9 .
unit tests, readme etc all ready to go https://github.com/openzipkin/brave/pull/369
This issue will close shortly, and be released as Brave 4.1 ahead of backporting all instrumentation to use it. The reason for this is that eventhough all instrumentation can use the new api, there are other concerns beyond propagation to look at when redoing instrumentation. For example, we haven't looked closely at how the auto-tagging works.
By releasing the "current span" api independently, others like CXF, ratpack etc can start using the api while we tighten up details on #350 which relies on this Api, but also relies on other things.
This is out in Brave 4.1 https://github.com/openzipkin/brave/tree/master/brave#current-span
Currently, Brave's supports fixed trace shapes defined below. The impact is generally found when people try to create more than one child span, and notice it overwrites the previous child.
https://github.com/openzipkin/brave/blob/3.6.0/brave-core/src/main/java/com/github/kristofa/brave/ServerSpanState.java https://github.com/openzipkin/brave/blob/3.6.0/brave-core/src/main/java/com/github/kristofa/brave/LocalSpanState.java https://github.com/openzipkin/brave/blob/3.6.0/brave-core/src/main/java/com/github/kristofa/brave/ClientSpanState.java
Support for variable span shapes has come up, usually in questions. Ex @galderz trying to trace infinispan, @michaelsembwever trying to make an open tracing bridge, and several gitter questions on https://gitter.im/openzipkin/zipkin.
Several options exist, and not all are either/or.