openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.36k stars 714 forks source link

Tracing information in log4j2 with gRPC #425

Closed kuroneko25 closed 7 years ago

kuroneko25 commented 7 years ago

Short summary of what I want to accomplish: We are building gRPC services and want to automatically include traceId and spanId in every log entry we emit from the service so we can correlate them. The logging library we use is Log4j2.

Brave libraries we use:

<dependency>
  <groupId>io.zipkin.brave</groupId>
  <artifactId>brave-instrumentation-grpc</artifactId>
  <version>4.3.0</version>
</dependency>
<dependency>
  <groupId>io.zipkin.brave</groupId>
  <artifactId>brave-context-log4j2</artifactId>
  <version>4.3.0</version>
</dependency>

Here is how I set up the gRPC server to use Brave:

...
GrpcTracing tracing = GrpcTracing.create(
    Tracing.newBuilder().localServiceName(serviceName)
        .reporter(Reporter.NOOP)
        .sampler(Sampler.ALWAYS_SAMPLE)
        .currentTraceContext(ThreadContextCurrentTraceContext.create())
        .build()
serverBuilder.addService(ServerInterceptors.intercept(
    serviceDefinition,
    tracing.newServerInterceptor()
);
serverBuilder.build();
...

We have a custom log4j layout class that spits out JSON. The relevant piece of code is something like this:

...
private JSONObject getTrace() {
  JSONObject json = new JSONObject();
  addValue(json, "traceId", ThreadContext.get(TRACE_ID));
  addValue(json, "spanId", ThreadContext.get(SPAN_ID));
  return json;
}
...

However ThreadContext.get("traceId") seems to be null always. I've stepped through the debugger and it looks like ThreadContextCurrentTraceContext is populating TraceContext with the correct trace and span in the server interceptor. Maybe this is due to the internal threading model of gRPC? Since the thread that runs the interceptor is different from the thread that runs the API code. They did mention something along these lines at the bottom of this page https://logging.apache.org/log4j/2.x/manual/thread-context.html. What should be the right way to achieve this? Should I implement my own CurrentTraceContext that puts tracing information into a gRPC context instead?

codefromthecrypt commented 7 years ago

I raised a PR about this. If that doesn't work OOB, it might be a case where...

kuroneko25 commented 7 years ago

I wrapped the log4j2 context and that didn't work. But just to make sure I'm accessing it correctly ThreadContext.get("traceId") is the expected usage?

kuroneko25 commented 7 years ago

I was able to work around this issue using opentracing and the brave bridge. Instead of using gRPC instrumentation from brave I'm using these two libraries:

    <dependency>
      <groupId>io.opentracing.contrib</groupId>
      <artifactId>grpc-opentracing</artifactId>
    </dependency>
    <dependency>
      <groupId>io.opentracing.brave</groupId>
      <artifactId>brave-opentracing</artifactId>
    </dependency>

To set it up:

      Tracer tracer = BraveTracer.create(Tracing.newBuilder()
          .localServiceName(serviceName)
          .reporter(Reporter.NOOP)
          .sampler(Sampler.ALWAYS_SAMPLE)
          .build());
      ServerTracingInterceptor tracingInterceptor = new ServerTracingInterceptor(tracer);
      serverBuilder.addService(ServerInterceptors.intercept(
          serviceDefinition,
          tracingInterceptor));

And I was able to access span information in my gRPC request handling code:

      if (OpenTracingContextKey.activeSpan() != null) {
        SpanContext spanContext = OpenTracingContextKey.activeSpan().context();
        if (spanContext instanceof BraveSpanContext) {
          TraceContext traceContext = ((BraveSpanContext) spanContext).unwrap();
          System.out.println(traceContext.traceId());
          System.out.println(traceContext.spanId());
          System.out.println(traceContext.parentId());
        }
      }

It looks like the grpc-opentracing library took care of injecting the span into gRPC context. Perhaps brave could do something similar as well?

codefromthecrypt commented 7 years ago

sorry I haven't resumed looking at this yet, today. I suspect it is only a matter of a missing callback in our code. I intend to make a test case proving this works.

codefromthecrypt commented 7 years ago

ok focused on this now (and feel free to chat on https://gitter.im/openzipkin/zipkin if you like). regardless I'll update when I get a PR with fix. Thanks for trying the thing I posted last night eventhough I didn't expect it would fix it.. it was just something quick to try before going to sleep.

codefromthecrypt commented 7 years ago

so looks like we had a test for the client, but not the server. looking at the server now

codefromthecrypt commented 7 years ago

https://github.com/openzipkin/brave/pull/428 for work in progress. should be fixed today and whether or not you use gRPC's context doesn't matter.

codefromthecrypt commented 7 years ago

ok found the bug. it was indeed a missing scope concern. will cut a patch release shortly

codefromthecrypt commented 7 years ago

4.3.2 on the way. let me know if it doesn't fix the issue

kuroneko25 commented 7 years ago

It doesn't look like the issue is fully resolved. I'm still not getting data in ThreadContext from my request handling code. I stepped through the debugger and it looks like ThreadContextCurrentTraceContext.newScope gets called once in the interceptor chain in the main event loop thread and a second time in the request handling thread but at onCompleted. The scope is then closed immediately.

    System.out.println("before complete: " + ThreadContext.get("traceId")); // null
    responseObserver.onNext(responseForRequest(request));
    responseObserver.onCompleted();
    System.out.println("after complete: " + ThreadContext.get("traceId")); // null

Maybe I'm missing something on the correct usage pattern?

devinsba commented 7 years ago

I don't know gRPC well enough to fix this, but I can provide an example that I know works for spring-sleuth:

public class GrpcTracingServerInterceptor implements ServerInterceptor {

    private Tracer tracer;
    private SpanExtractor<Metadata> spanExtractor;

    @Autowired
    public GrpcTracingServerInterceptor(Tracer tracer, SpanExtractor<Metadata> spanExtractor) {
        this.tracer = tracer;
        this.spanExtractor = spanExtractor;
    }

    @Override
    public <ReqT, RespT> ServerCall.Listener<ReqT> interceptCall(ServerCall<ReqT, RespT> call, Metadata headers, ServerCallHandler<ReqT, RespT> next) {
        final Span span = spanExtractor.joinTrace(headers);
        tracer.continueSpan(span);
        span.logEvent(Span.SERVER_RECV);

        final ServerCall.Listener<ReqT> original = next.startCall(call, headers);
        return new ForwardingServerCallListener.SimpleForwardingServerCallListener<ReqT>(original) {

            @Override
            public void onComplete() {
                tracer.continueSpan(span);
                span.logEvent(Span.SERVER_SEND);
                tracer.close(span);
                super.onComplete();
            }
        };
    }
}
codefromthecrypt commented 7 years ago

can you make a test for what isn't working as a pull request? here's a good place to start.

https://github.com/openzipkin/brave/blob/master/instrumentation/grpc/src/test/java/brave/grpc/ITTracingServerInterceptor.java#L146

Much quicker to understand, and this is how we'd fix it anyway

codefromthecrypt commented 7 years ago

heh thanks brian. In the mean time I can poke around to try and reproduce.. there are multiple interaction patterns in grpc, maybe we are missing one.

kuroneko25 commented 7 years ago

Basically if you had a line like this LOG.info("some stuff") in the implementation of io.grpc.examples.helloworld.GreeterGrpc.sayHello() they would not include any trace information. I'll try to come up with a test case for that.

wdittmer-mp commented 7 years ago

Hi,

Was creating a test project and ran into this issue as well. I also see that the MDC values are cleared, just before it gets to the implementation of the service code.

What I found was (in 4.3.2): The scope is set at TracingServerInterceptor.interceptCall() with the values of the gRPC meta data. Next the scope is set at TracingServerInterceptor.request().

It closes the scope in TracingServerInterceptor when exiting the request scope (line 60), here the previousTraceId and previousSpanId are set, so the MDC keep their values.

Then it closes the scope in TracingServerInterceptor when done with the interceptRequest, here previousTraceId and previousSpanId are both null, so this clears the MDC values.

And then the implementation of the gRPC service is called, and the MDC values are gone.

Then at responseObserver.onCompleted() it creates a newScope again and the MDC values are back.

Just upgraded to 4.3.3 and still see it does not work.

codefromthecrypt commented 7 years ago

thanks for the ping. this ball was dropped (I went on holiday and no-one picked it up)

wdittmer-mp commented 7 years ago

Follow up: In NettyServerStream.java a new Runnable is created (see request() line 102). This runnable runs on the grpc-default-worker-ELG thread pool and then via the CurrentTraceContextExecutor puts the task on my own thread pool. But at this point it has no context and cannot get to the traceId/spanId or MDC values. I also tried the (experimental) GrpcCurrentTraceContext, but this does not work either. Does this make sense?

codefromthecrypt commented 7 years ago

Great details. Got distracted today, but can give a swing tomorrow.

On 7 Jun 2017 22:39, "Wilfred Dittmer" notifications@github.com wrote:

Follow up: In NettyServerStream.java a new Runnable is created (see request() line 102). This runnable runs on the grpc-default-worker-ELG thread pool and then via the CurrentTraceContextExecutor puts the task on my own thread pool. But at this point it has no context and cannot get to the traceId/spanId or MDC values. I also tried the (experimental) GrpcCurrentTraceContext, but this does not work either. Does this make sense?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/425#issuecomment-306815707, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61-T5_EeNswxXUKuQf90HiDJ_0iP8ks5sBrY0gaJpZM4Nb2Wm .

codefromthecrypt commented 7 years ago

https://github.com/openzipkin/brave/pull/436

wdittmer-mp commented 7 years ago

I checked the code change and see it works now, also with a custom thread pool and MDCCurrentTraceContext. I will try to incorporate this in my test project and check that when the gRPC server implementation calls another service, the trace_id continues to be propagated.

codefromthecrypt commented 7 years ago

@wdittmer-mp great thanks. will release a patch once confirmed

wdittmer-mp commented 7 years ago

Quickly tested with the 4.3.4-SNAPSHOT build and I can see in the logging the traceId and spanId when I log in the server implementation. Next, calling another service from the server implementation to check propagation. (Still need to write that 'cause I was stuck on logging issue :)

----------------INBOUND--------------------
[id: 0xee4e7afc, L:/127.0.0.1:6600 - R:/127.0.0.1:49529] HEADERS: streamId=3, headers=GrpcHttp2RequestHeaders[:path: /serviceb.ServiceB/ShortRunning, :authority: localhost:6600, :method: POST, :scheme: http, te: trailers, content-type: application/grpc, user-agent: grpc-java-netty/1.3.0, x-b3-traceid: 2fc70b202538a54a, x-b3-spanid: 6c537a6139d78bbe, x-b3-parentspanid: 2fc70b202538a54a, x-b3-sampled: 1, x-b3-traceid: **2fc70b202538a54a**, x-b3-spanid: **a091f1f9fb4e9e1b**, x-b3-parentspanid: 6c537a6139d78bbe, x-b3-sampled: 1, grpc-accept-encoding: gzip, grpc-timeout: 59989392u], streamDependency=0, weight=16, exclusive=false, padding=0, endStream=false
------------------------------------
2017-06-09 14:39:04,374 [/] [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler - 
----------------INBOUND--------------------
[id: 0xee4e7afc, L:/127.0.0.1:6600 - R:/127.0.0.1:49529] DATA: streamId=3, padding=0, endStream=true, length=5, bytes=0000000000
------------------------------------
2017-06-09 14:39:04,384 [**2fc70b202538a54a/a091f1f9fb4e9e1b**] [my-test-thread-pool] INFO  z.s.grpc.serviceb.GrpcServiceBImpl - TEST B called with shortRunning
2017-06-09 14:39:04,384 [**2fc70b202538a54a/a091f1f9fb4e9e1b**] [my-test-thread-pool] INFO  zipkintrace.serviceb.ServiceB - shortRunning was called: 
codefromthecrypt commented 7 years ago

thanks for the early test. I've merged so you can test the published to jfrog version http://oss.jfrog.org/artifactory/oss-snapshot-local is the mvn snapshot repo

once ready let us know and we'll cut a patch version

wdittmer-mp commented 7 years ago

TL;DR: works with my test project. I can see the MDC values HTTP -> HTTP, HTTP -> gRPC, gRPC -> gRPC and gRPC -> HTTP.

So I wrote a test project that has serviceA which is HTTP only, serviceB which is gRPC only and serviceC which has both gRPC and HTTP end points. The test was run with the 4.3.4 release containing the fix. My goal was to get the MDC values filled with the traceId and spanId to ultimately be able to trace back the logging to the Zipkin traces. Below three tests and the logging with pattern %d [%X{traceId}/%X{spanId}] [%thread] %-5level %logger{36} - %msg%n. For each test the traceId is expected to remain the same and the spanId to change when crossing the service boundary.

Test 1: A:HTTP -> B:gRPC -> C:HTTP

Service A:

2017-06-15 14:04:58,185 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO zipkintrace.servicea.web.AController - A:HTTP -> B:gRPC -> C:HTTP 2017-06-15 14:04:58,185 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO zipkintrace.servicea.ServiceA - Running process A in service A 2017-06-15 14:04:58,185 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Calling shortRunning on Service B 2017-06-15 14:04:58,867 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Result of shortRunning on Service B: 100 2017-06-15 14:04:58,867 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Calling longRunning on Service B 2017-06-15 14:04:59,424 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Result of longRunning on Service B: 300

Service B:

2017-06-15 14:04:58,188 [e9d5e22a72ef17de/243002ca06e5289c] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - Service B called with shortRunning 2017-06-15 14:04:58,188 [e9d5e22a72ef17de/243002ca06e5289c] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - shortRunning was called 2017-06-15 14:04:58,761 [e9d5e22a72ef17de/243002ca06e5289c] [my-test-thread-pool] INFO z.serviceb.services.ServiceC - Got HTTP response from service C: test call C 2017-06-15 14:04:58,869 [e9d5e22a72ef17de/1ef903de4d474d60] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - Service B called with longRunning 2017-06-15 14:04:58,869 [e9d5e22a72ef17de/1ef903de4d474d60] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - longRunning was called 2017-06-15 14:04:59,120 [e9d5e22a72ef17de/1ef903de4d474d60] [my-test-thread-pool] INFO z.serviceb.services.ServiceC - Got HTTP response from service C: test call C

Service C:

2017-06-15 14:04:58,382 [e9d5e22a72ef17de/510914d4406c34b8] [http-nio-8100-exec-1] INFO zipkintrace.servicec.ServiceC - Processing in Service C 2017-06-15 14:04:58,873 [e9d5e22a72ef17de/64a797e95220f094] [http-nio-8100-exec-2] INFO zipkintrace.servicec.ServiceC - Processing in Service C

Test 2: A:HTTP->B:gRPC -> C:gRPC

Service A:

2017-06-15 14:03:44,452 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO zipkintrace.servicea.web.AController - A:HTTP -> B:gRPC -> C:gRPC 2017-06-15 14:03:44,452 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO zipkintrace.servicea.ServiceA - Running process B in service A 2017-06-15 14:03:44,452 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO z.servicea.services.ServiceB - Calling viaC on Service B 2017-06-15 14:03:45,005 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO z.servicea.services.ServiceB - Result of viaC on Service B: 500

Service B:

2017-06-15 14:03:44,693 [3ccbf053cdafa01c/3fd4c81866c7a092] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - Service B called with viaC 2017-06-15 14:03:44,693 [3ccbf053cdafa01c/3fd4c81866c7a092] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - viaC was called 2017-06-15 14:03:44,704 [3ccbf053cdafa01c/811a1ec81b6ef68c] [my-test-thread-pool] INFO io.grpc.internal.ClientCallImpl - Call timeout set to '59968182502' ns, due to context deadline. Explicit call timeout was '59991494758' ns. 2017-06-15 14:03:44,989 [3ccbf053cdafa01c/3fd4c81866c7a092] [my-test-thread-pool] INFO z.serviceb.services.ServiceC - Got gRPC response from service C: 789

Service C:

017-06-15 14:03:44,831 [3ccbf053cdafa01c/811a1ec81b6ef68c] [my-test-thread-pool] INFO z.s.grpc.servicec.GrpcServiceCImpl - Service C called with callC 2017-06-15 14:03:44,831 [3ccbf053cdafa01c/811a1ec81b6ef68c] [my-test-thread-pool] INFO zipkintrace.servicec.ServiceC - Service C: Call C

Test 3: A:HTTP->C:HTTP

Service A:

2017-06-15 14:05:02,012 [8b4a42c23ff8ca24/8b4a42c23ff8ca24] [http-nio-8080-exec-4] INFO zipkintrace.servicea.web.AController - A:HTTP -> C:HTTP 2017-06-15 14:05:02,012 [8b4a42c23ff8ca24/8b4a42c23ff8ca24] [http-nio-8080-exec-4] INFO zipkintrace.servicea.ServiceA - Running process C in service A 2017-06-15 14:05:02,294 [8b4a42c23ff8ca24/8b4a42c23ff8ca24] [http-nio-8080-exec-4] INFO z.servicea.services.ServiceC - Got response from service C: test call C

Service C:

2017-06-15 14:05:02,037 [8b4a42c23ff8ca24/cd56ee39fde2e096] [http-nio-8100-exec-3] INFO zipkintrace.servicec.ServiceC - Processing in Service C

codefromthecrypt commented 7 years ago

nice report!