openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.36k stars 713 forks source link

Feature request: a way to get traceId into an SLF4j MDC #150

Closed ryangardner closed 7 years ago

ryangardner commented 8 years ago

We've got a lot of log aggregation software already, and it'd be very powerful if we could correlate traceIds in our logs to specific requests in Zipkin (see a slow request in Zipkin? go search for the traceId in our log aggregation and see any messages that tie into it)

It'd be nice to be able to put the traceId into the MDC every time it is set and clear it out whenever the request is over (or the current one is set to nothing) then clear it out.

I think in order for this to work some kind of hook needs to be exposed.

I can do it myself with AOP and intercepting calls to the ServerClientAndLocalSpanState - but I think that other people would like this functionality as well (It's one of the first things people ask me about in my company when I show brave / zipkin to them)

If you can think of the approach you would prefer for this, I can take a first stab at implementing it and submit a pull request with the changes

codefromthecrypt commented 8 years ago

relates to https://github.com/openzipkin/brave/pull/110 if you don't mind reading up on that.

ryangardner commented 8 years ago

Switching out the spanId generator would make another place to work with - but wouldn't necessarily let you know when the spans are created / ended, but a couple other ways that come to mind to handle this are:

codefromthecrypt commented 8 years ago

ahh I remember this. yeah it is a bit tricky as lifecycle of a span aren't abstracted 100% (ex we have 3 tracers at the moment). One thing to add to the toolbox is SpanCollector.collect() is definitely the end.

spring-cloud-sleuth has an implementation like what you're mentioning.. I think its scope is a bit broader as it requires ids even when the trace isn't sampled. anyway prior art is often helpful. https://github.com/spring-cloud/spring-cloud-sleuth/blob/master/spring-cloud-sleuth-core/src/main/java/org/springframework/cloud/sleuth/log/Slf4jSpanLogger.java

Other interest is opentracing's span is like a tracer, and seems to have end callbacks. how this goes across threads etc is as yet undefined. in case you'd like a look https://github.com/opentracing/opentracing-java/pull/25

ryangardner commented 8 years ago

I came up with this thing that seems to work for now - I made a decorator for the ServerClientAndLocalSpanState:


import com.github.kristofa.brave.IdConversion;
import com.github.kristofa.brave.ServerClientAndLocalSpanState;
import com.github.kristofa.brave.ServerSpan;
import com.twitter.zipkin.gen.Endpoint;
import com.twitter.zipkin.gen.Span;
import org.slf4j.MDC;

/**
 * Expose the trace / spanIds to the MDC and delegates all operations to the provided ServerClientAndLocalSpanState
 */
public class MdcDecoratingServerClientAndLocalSpanState implements ServerClientAndLocalSpanState {
    public final String ZIPKIN_TRACEID_MDC = "zipkinTraceId";
    public final String ZIPKIN_SPANID_MDC = "zipkinSpanId";
    public final String ZIPKIN_PARENTID_MDC = "zipkinParentId";

    private ServerClientAndLocalSpanState serverClientAndLocalSpanState;

    public MdcDecoratingServerClientAndLocalSpanState(ServerClientAndLocalSpanState serverClientAndLocalSpanState) {
        this.serverClientAndLocalSpanState = serverClientAndLocalSpanState;
    }

    @Override
    public ServerSpan getCurrentServerSpan() {
        return serverClientAndLocalSpanState.getCurrentServerSpan();
    }

    @Override
    public Endpoint getServerEndpoint() {
        return serverClientAndLocalSpanState.getServerEndpoint();
    }

    // clear the MDC when the CurrentServerSpan is cleared, store info in the MDC when the span state is valid
    @Override
    public void setCurrentServerSpan(ServerSpan span) {
        if (span == null || span.getSpan() == null || span.getSample() == false) {
            MDC.remove(ZIPKIN_TRACEID_MDC);
            MDC.remove(ZIPKIN_SPANID_MDC);
            MDC.remove(ZIPKIN_PARENTID_MDC);
        } else {
            MDC.put(ZIPKIN_TRACEID_MDC, IdConversion.convertToString(span.getSpan().getTrace_id()));
            MDC.put(ZIPKIN_SPANID_MDC, IdConversion.convertToString(span.getSpan().getId()));
            if (span.getSpan().getParent_id() != null) {
                MDC.put(ZIPKIN_PARENTID_MDC, IdConversion.convertToString(span.getSpan().getParent_id()));
            } else {
                MDC.remove(ZIPKIN_PARENTID_MDC);
            }
        }
        serverClientAndLocalSpanState.setCurrentServerSpan(span);
    }

    @Override
    public Endpoint getClientEndpoint() {
        return serverClientAndLocalSpanState.getClientEndpoint();
    }

    @Override
    public Span getCurrentClientSpan() {
        return serverClientAndLocalSpanState.getCurrentClientSpan();
    }

    @Override
    public void setCurrentClientSpan(Span span) {
        serverClientAndLocalSpanState.setCurrentClientSpan(span);
    }

    @Override
    public void setCurrentClientServiceName(String serviceName) {
        serverClientAndLocalSpanState.setCurrentClientServiceName(serviceName);
    }

    @Override
    public Boolean sample() {
        return serverClientAndLocalSpanState.sample();
    }

    @Override
    public Span getCurrentLocalSpan() {
        return serverClientAndLocalSpanState.getCurrentLocalSpan();
    }

    @Override
    public void setCurrentLocalSpan(Span span) {
        serverClientAndLocalSpanState.setCurrentLocalSpan(span);
    }
}

I'll take a look at the spring cloud sleuth stuff too

codefromthecrypt commented 8 years ago

Clever! Thanks for sharing

kristofa commented 8 years ago

Yes, clever. I didn't know about MDC so I learned something new :)

One of the interesting side effects of enabling distributed tracing and trace context propagation is that it should be straight forward to enable log correlation for requests across services. Unfortunately the current brave api doesn't make it easy to do but it looks like what you have should work.

eirslett commented 8 years ago

If Brave standardises on something like SLF4J (instead of jul), we could do it as part of Brave core?

kristofa commented 8 years ago

The switch to JUL for the main code was done to get rid of dependencies for brave-core.

Unfortunately this code will only work for traces that are actually being sampled which can be a small percentage. For log correlation it would be nicer if we can generate span ids even if we decide we don't sample the trace.

eirslett commented 8 years ago

I was going to bring up exactly that - unlike Finagle, Brave simply stops tracing behaviour altogether if a trace is decided to not be sampled. It would be better if that was changed, so that tracing occurs anyways, but the span consumer ignores unsampled spans?

eirslett commented 8 years ago

Regarding jul vs slf4j, maybe it's possible to make some kind of hook in Brave, where we can register a MDC "interceptor"? Then Brave wouldn't need a hard dependency on slf4j.

codefromthecrypt commented 8 years ago

A problem could be that you don't reduce application overhead as a side-effect of sampling, but that may be ok for some folks. Also, if we ever do after-the-fact sampling, we'll run into this anyway. There are ways to avoid some overhead here, for example, google have a string cat api, where annotation.values aren't joined unless/until they are sent out of process.

You'll notice there's a lot of places we early-exit code if something wasn't sampled.

ryangardner commented 8 years ago

Indeed this does only work for requests that are already sampled. The scenario people wanted to use it for was "If I find a really long request in zipkin, and now I want to go to our log aggregation framework and look it up - if I could just paste in that traceId and then find any messages that were logged along the way that would help me gather additional information"

It seems that spring-cloud-sleuth seems to have implemented a layer around brave to have it always generate traceIds and sampleIds even if a request isn't sampled - this seems like it shouldn't be too hard to do in brave since the "sampled" header is being passed along with the rest of the headers... there might be a few places that need to be tweaked but I don't think

SLF4j is an extremely widespread logging API, and doesn't dictate a specific logging framework (someone could use slf4j over java.util.logging if they wanted to by including sjf4j-jul) - removing certain dependencies from brave-core seems prudent but removing all dependencies at some point reaches a point of diminishing returns. The logging framework one is one of those rare cases that a lot of places bring in slf4j simply because JUL is considered to be one of the least desirable logging frameworks around.

As far as request overhead is concerned, you could still have the traceIds get passed along but have the short-circuiting if a request isn't sampled based on the "sampled" header being set to false - then you could still pass the traceId and spanIds along but only do additional things like adding annotations if the request is actually being sampled. (A lot of the short circuits are based around null checks now it seems, switching it to be around a false check seems like it wouldn't be too hard)

codefromthecrypt commented 8 years ago

There are other opinions on the matter of logging, including log4j2 etc. I don't really want us to escalate this feature into that kind of bikeshed, considering it can be non-invasively accomplished without changing the hard deps. I'm a no on adding slf4j as a hard-dependency, but we should be able to set it as an optional one to do the job.

One way is instead of passing the trace id, to pass the propagated context around, similar to finagle's TraceId (which includes propagated context and also sampled flag iirc). So, it would be api breaking, but a way to implement what Ryan said.

eirslett commented 8 years ago

I think the bulk of the overhead in Zipkin instrumentation is about sending span data via the transport to the collectors and storing them in the storage engine. The extra CPU cycles aren't that expensive, and the HTTP headers aren't that big. (And, it works for Twitter, apparently)

codefromthecrypt commented 8 years ago

I think the bulk of the overhead in Zipkin instrumentation is about sending span data via the transport to the collectors and storing them in the storage engine. The extra CPU cycles aren't that expensive, and the HTTP headers aren't that big. (And, it works for Twitter, apparently)

To be clear, I think we are on the same page. Use twitter's approach which is still before-the-fact sampling and early exit logic.

Ex. we collect and pass around the trace id (which is cheap and could use for MDC etc), but still guard before locally collecting annotations via "if sampled or debug, then do more expensive things like trace a function or concat strings" That way, we can permit SLF4J or whatever, but also honor early-exit logic. Make sense?

TraceId(, , _, Some(false), flags) if !flags.isDebug

https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Trace.scala#L254

eirslett commented 8 years ago

Ex. we collect and pass around the trace id

Yes, that's what we should do.

Brave's current implementation (https://github.com/openzipkin/brave/blob/2d19a27bc917e679b3527483c114d0a86949e440/brave-core/src/main/java/com/github/kristofa/brave/ClientTracer.java#L100) is to ditch the trace id altogether, and reset, as if no tracing had ever happened. I think we should change the way Brave works, so that it acts more like Finagle.

codefromthecrypt commented 8 years ago

ok unless someone else says so, I'll raise a PR to rejig the ids as a separate PR. Afterwards, we can use this to implement MDC logging even when we aren't "actively tracing"

kristofa commented 8 years ago

@adriancole Great that you want to implement this!

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

testark commented 8 years ago

Has any progress been made on this lately?

codefromthecrypt commented 8 years ago

no progress as displaced on server work then holiday. regrouping now

eirslett commented 8 years ago

One thing that worked well in zipkin-js was to make the context api pluggable. In Brave, it's hard-coded to be a ThreadLocal. I'm sure we should be able to make something similar in Brave. Then, people could add their own context, or we could have an interface with onChange handlers that people can hook into to build SLF4J support in a loosely coupled fashion.

kuroneko25 commented 7 years ago

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

This is exactly what we are trying to do in our application. Has any progress been made on this? If not what is the closest work around that we could implement today?

codefromthecrypt commented 7 years ago

No progress, but to Eirik's point, we should make sure any new Context api has a means to do this.

Closest would be to make a wrapping ServerClientAndLocalSpanState, and associate/disassociate the trace id with the MDC there.

@Override public void setCurrentServerSpan(ServerSpan span) { if (span == null) { MDC.remove("your-trace-id-key"); } else { MDC.put("your-trace-id-key", IdConversion.convertToString(span.getSpan().getTrace_id()))); } delegate. setCurrentServerSpan(span); }

// repeat for similar things...

On Mon, Sep 19, 2016 at 5:52 AM, kuroneko25 notifications@github.com wrote:

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

This is exactly what we are trying to do as well in our application. Has any progress been made on this? If not what is the closest work around that we could implement today?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tmack8001 commented 7 years ago

This is very similar behavior as to what I want to put into an application of mine that is using brave. Any further progress, how can I help make this a reality for us all?

codefromthecrypt commented 7 years ago

FWIW @eirslett's suggestion of a Context (basically a trace id holder) which is decoupled from thread locals is the way to go. This is used by zipkin-js, but also in finagle I believe. Even log4j2 recently started supporting affecting their context in a way you can control whether thread locals are used.

TL;DR; we should decouple brave in some generic fashion from thread locals, but also keep in mind the use-case of synchronizing with logging contexts. If someone wants to spike an impl please do! In absence, I'm pretty close to being interested enough to take it on :)

kuroneko25 commented 7 years ago

Anything new for this this in Brave 4?

codefromthecrypt commented 7 years ago

not yet, as brave 4 doesn't have any code yet for in-process propagation of any sort. pull requests welcome!

codefromthecrypt commented 7 years ago

Here's a recap of where this is (and practical knowledge about things).

Brave 4 was designed to unhook brave from a specific thread model. This was direly needed in order to properly address topics of in-process propagation and retrieval of the "current span". For example, we definitely don't need ServerSpan anymore (since all spans are now the same and they propagate sampled state). This simplifies the problem space such that things like MDC integration are easier to deal with.

Brave 4 is similar to Finagle, now. For example, there's an internal division between a trace context (which is the only thing used by MDC), and the span which does heavy lifting (ex adding tags). Moreover, you now always have a trace context, even when not sampled. This means MDC can rely on identifiers being present even if they aren't reported to Zipkin... a big deal.

So, how do we get to MDC integration (slf4j, log4j2 etc), then?!

Well, firstly we need a "current span" api, which will have a default implementation with possibly pluggable storage. For example, zipkin-js has this, and we might want to re-use grpc's propagation model, when on their platform. Taking lessons learned from Finagle, this almost certainly should be a reference to a TraceContext (which could be converted to a span via Tracer.newSpan as needed).

Internally, we need hooks to manage the attachment and re-attachment of span context to that api (not that the mutation might actually be internal or an SPI.. iotw, users may never see how a span is attached). The lifecycle will typically end up routing through brave's recorder (on joinSpan, newTrace, newSpan and corresponding finish/flush statements).

In the process of implementing, we can decide how much of this is practical to expose, as complexity is easy here. For example, should SLF4J be special cased (as it is in spring-cloud-sleuth)? or should we allow an event model which can synchronize multiple state stores on activation or deactivation of a span (ex SLF4J and also Brave 3 or gRPC context?)

On a practical note, I have no problem with and in fact am actively interested in doing this work. However, I'd love someone to offer to help. I round-robin through zipkin projects and am currently working on server improvements. This means probably a week or two before I can start, maybe longer. Regardless, I hope this stuff helps in case someone wants to know more or help.

codefromthecrypt commented 7 years ago

@kmaci3k fyi here's the issue about whatever api we end up with that supports SLF4J integration.

codefromthecrypt commented 7 years ago

summarized a design I have in mind for the current span problem here https://github.com/openzipkin/brave/issues/166#issuecomment-277634798

codefromthecrypt commented 7 years ago

https://github.com/openzipkin/brave/pull/389 here's the impl

kuroneko25 commented 7 years ago

How do we get this in log4j2?