Closed kuroneko25 closed 7 years ago
I raised a PR about this. If that doesn't work OOB, it might be a case where...
I wrapped the log4j2 context and that didn't work. But just to make sure I'm accessing it correctly ThreadContext.get("traceId")
is the expected usage?
I was able to work around this issue using opentracing and the brave bridge. Instead of using gRPC instrumentation from brave I'm using these two libraries:
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>grpc-opentracing</artifactId>
</dependency>
<dependency>
<groupId>io.opentracing.brave</groupId>
<artifactId>brave-opentracing</artifactId>
</dependency>
To set it up:
Tracer tracer = BraveTracer.create(Tracing.newBuilder()
.localServiceName(serviceName)
.reporter(Reporter.NOOP)
.sampler(Sampler.ALWAYS_SAMPLE)
.build());
ServerTracingInterceptor tracingInterceptor = new ServerTracingInterceptor(tracer);
serverBuilder.addService(ServerInterceptors.intercept(
serviceDefinition,
tracingInterceptor));
And I was able to access span information in my gRPC request handling code:
if (OpenTracingContextKey.activeSpan() != null) {
SpanContext spanContext = OpenTracingContextKey.activeSpan().context();
if (spanContext instanceof BraveSpanContext) {
TraceContext traceContext = ((BraveSpanContext) spanContext).unwrap();
System.out.println(traceContext.traceId());
System.out.println(traceContext.spanId());
System.out.println(traceContext.parentId());
}
}
It looks like the grpc-opentracing
library took care of injecting the span into gRPC context. Perhaps brave could do something similar as well?
sorry I haven't resumed looking at this yet, today. I suspect it is only a matter of a missing callback in our code. I intend to make a test case proving this works.
ok focused on this now (and feel free to chat on https://gitter.im/openzipkin/zipkin if you like). regardless I'll update when I get a PR with fix. Thanks for trying the thing I posted last night eventhough I didn't expect it would fix it.. it was just something quick to try before going to sleep.
so looks like we had a test for the client, but not the server. looking at the server now
https://github.com/openzipkin/brave/pull/428 for work in progress. should be fixed today and whether or not you use gRPC's context doesn't matter.
ok found the bug. it was indeed a missing scope concern. will cut a patch release shortly
4.3.2 on the way. let me know if it doesn't fix the issue
It doesn't look like the issue is fully resolved. I'm still not getting data in ThreadContext
from my request handling code. I stepped through the debugger and it looks like ThreadContextCurrentTraceContext.newScope
gets called once in the interceptor chain in the main event loop thread and a second time in the request handling thread but at onCompleted
. The scope is then closed immediately.
System.out.println("before complete: " + ThreadContext.get("traceId")); // null
responseObserver.onNext(responseForRequest(request));
responseObserver.onCompleted();
System.out.println("after complete: " + ThreadContext.get("traceId")); // null
Maybe I'm missing something on the correct usage pattern?
I don't know gRPC well enough to fix this, but I can provide an example that I know works for spring-sleuth:
public class GrpcTracingServerInterceptor implements ServerInterceptor {
private Tracer tracer;
private SpanExtractor<Metadata> spanExtractor;
@Autowired
public GrpcTracingServerInterceptor(Tracer tracer, SpanExtractor<Metadata> spanExtractor) {
this.tracer = tracer;
this.spanExtractor = spanExtractor;
}
@Override
public <ReqT, RespT> ServerCall.Listener<ReqT> interceptCall(ServerCall<ReqT, RespT> call, Metadata headers, ServerCallHandler<ReqT, RespT> next) {
final Span span = spanExtractor.joinTrace(headers);
tracer.continueSpan(span);
span.logEvent(Span.SERVER_RECV);
final ServerCall.Listener<ReqT> original = next.startCall(call, headers);
return new ForwardingServerCallListener.SimpleForwardingServerCallListener<ReqT>(original) {
@Override
public void onComplete() {
tracer.continueSpan(span);
span.logEvent(Span.SERVER_SEND);
tracer.close(span);
super.onComplete();
}
};
}
}
can you make a test for what isn't working as a pull request? here's a good place to start.
Much quicker to understand, and this is how we'd fix it anyway
heh thanks brian. In the mean time I can poke around to try and reproduce.. there are multiple interaction patterns in grpc, maybe we are missing one.
Basically if you had a line like this LOG.info("some stuff")
in the implementation of io.grpc.examples.helloworld.GreeterGrpc.sayHello()
they would not include any trace information. I'll try to come up with a test case for that.
Hi,
Was creating a test project and ran into this issue as well. I also see that the MDC values are cleared, just before it gets to the implementation of the service code.
What I found was (in 4.3.2): The scope is set at TracingServerInterceptor.interceptCall() with the values of the gRPC meta data. Next the scope is set at TracingServerInterceptor.request().
It closes the scope in TracingServerInterceptor when exiting the request scope (line 60), here the previousTraceId and previousSpanId are set, so the MDC keep their values.
Then it closes the scope in TracingServerInterceptor when done with the interceptRequest, here previousTraceId and previousSpanId are both null, so this clears the MDC values.
And then the implementation of the gRPC service is called, and the MDC values are gone.
Then at responseObserver.onCompleted() it creates a newScope again and the MDC values are back.
Just upgraded to 4.3.3 and still see it does not work.
thanks for the ping. this ball was dropped (I went on holiday and no-one picked it up)
Follow up: In NettyServerStream.java a new Runnable is created (see request() line 102). This runnable runs on the grpc-default-worker-ELG thread pool and then via the CurrentTraceContextExecutor puts the task on my own thread pool. But at this point it has no context and cannot get to the traceId/spanId or MDC values. I also tried the (experimental) GrpcCurrentTraceContext, but this does not work either. Does this make sense?
Great details. Got distracted today, but can give a swing tomorrow.
On 7 Jun 2017 22:39, "Wilfred Dittmer" notifications@github.com wrote:
Follow up: In NettyServerStream.java a new Runnable is created (see request() line 102). This runnable runs on the grpc-default-worker-ELG thread pool and then via the CurrentTraceContextExecutor puts the task on my own thread pool. But at this point it has no context and cannot get to the traceId/spanId or MDC values. I also tried the (experimental) GrpcCurrentTraceContext, but this does not work either. Does this make sense?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/425#issuecomment-306815707, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61-T5_EeNswxXUKuQf90HiDJ_0iP8ks5sBrY0gaJpZM4Nb2Wm .
I checked the code change and see it works now, also with a custom thread pool and MDCCurrentTraceContext. I will try to incorporate this in my test project and check that when the gRPC server implementation calls another service, the trace_id continues to be propagated.
@wdittmer-mp great thanks. will release a patch once confirmed
Quickly tested with the 4.3.4-SNAPSHOT build and I can see in the logging the traceId and spanId when I log in the server implementation. Next, calling another service from the server implementation to check propagation. (Still need to write that 'cause I was stuck on logging issue :)
----------------INBOUND--------------------
[id: 0xee4e7afc, L:/127.0.0.1:6600 - R:/127.0.0.1:49529] HEADERS: streamId=3, headers=GrpcHttp2RequestHeaders[:path: /serviceb.ServiceB/ShortRunning, :authority: localhost:6600, :method: POST, :scheme: http, te: trailers, content-type: application/grpc, user-agent: grpc-java-netty/1.3.0, x-b3-traceid: 2fc70b202538a54a, x-b3-spanid: 6c537a6139d78bbe, x-b3-parentspanid: 2fc70b202538a54a, x-b3-sampled: 1, x-b3-traceid: **2fc70b202538a54a**, x-b3-spanid: **a091f1f9fb4e9e1b**, x-b3-parentspanid: 6c537a6139d78bbe, x-b3-sampled: 1, grpc-accept-encoding: gzip, grpc-timeout: 59989392u], streamDependency=0, weight=16, exclusive=false, padding=0, endStream=false
------------------------------------
2017-06-09 14:39:04,374 [/] [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler -
----------------INBOUND--------------------
[id: 0xee4e7afc, L:/127.0.0.1:6600 - R:/127.0.0.1:49529] DATA: streamId=3, padding=0, endStream=true, length=5, bytes=0000000000
------------------------------------
2017-06-09 14:39:04,384 [**2fc70b202538a54a/a091f1f9fb4e9e1b**] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - TEST B called with shortRunning
2017-06-09 14:39:04,384 [**2fc70b202538a54a/a091f1f9fb4e9e1b**] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - shortRunning was called:
thanks for the early test. I've merged so you can test the published to jfrog version http://oss.jfrog.org/artifactory/oss-snapshot-local is the mvn snapshot repo
once ready let us know and we'll cut a patch version
TL;DR: works with my test project. I can see the MDC values HTTP -> HTTP, HTTP -> gRPC, gRPC -> gRPC and gRPC -> HTTP.
So I wrote a test project that has serviceA which is HTTP only, serviceB which is gRPC only and serviceC which has both gRPC and HTTP end points. The test was run with the 4.3.4 release containing the fix.
My goal was to get the MDC values filled with the traceId and spanId to ultimately be able to trace back the logging to the Zipkin traces. Below three tests and the logging with pattern
%d [%X{traceId}/%X{spanId}] [%thread] %-5level %logger{36} - %msg%n
. For each test the traceId is expected to remain the same and the spanId to change when crossing the service boundary.
Service A:
2017-06-15 14:04:58,185 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO zipkintrace.servicea.web.AController - A:HTTP -> B:gRPC -> C:HTTP 2017-06-15 14:04:58,185 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO zipkintrace.servicea.ServiceA - Running process A in service A 2017-06-15 14:04:58,185 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Calling shortRunning on Service B 2017-06-15 14:04:58,867 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Result of shortRunning on Service B: 100 2017-06-15 14:04:58,867 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Calling longRunning on Service B 2017-06-15 14:04:59,424 [e9d5e22a72ef17de/e9d5e22a72ef17de] [http-nio-8080-exec-3] INFO z.servicea.services.ServiceB - Result of longRunning on Service B: 300
Service B:
2017-06-15 14:04:58,188 [e9d5e22a72ef17de/243002ca06e5289c] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - Service B called with shortRunning 2017-06-15 14:04:58,188 [e9d5e22a72ef17de/243002ca06e5289c] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - shortRunning was called 2017-06-15 14:04:58,761 [e9d5e22a72ef17de/243002ca06e5289c] [my-test-thread-pool] INFO z.serviceb.services.ServiceC - Got HTTP response from service C: test call C 2017-06-15 14:04:58,869 [e9d5e22a72ef17de/1ef903de4d474d60] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - Service B called with longRunning 2017-06-15 14:04:58,869 [e9d5e22a72ef17de/1ef903de4d474d60] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - longRunning was called 2017-06-15 14:04:59,120 [e9d5e22a72ef17de/1ef903de4d474d60] [my-test-thread-pool] INFO z.serviceb.services.ServiceC - Got HTTP response from service C: test call C
Service C:
2017-06-15 14:04:58,382 [e9d5e22a72ef17de/510914d4406c34b8] [http-nio-8100-exec-1] INFO zipkintrace.servicec.ServiceC - Processing in Service C 2017-06-15 14:04:58,873 [e9d5e22a72ef17de/64a797e95220f094] [http-nio-8100-exec-2] INFO zipkintrace.servicec.ServiceC - Processing in Service C
Service A:
2017-06-15 14:03:44,452 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO zipkintrace.servicea.web.AController - A:HTTP -> B:gRPC -> C:gRPC 2017-06-15 14:03:44,452 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO zipkintrace.servicea.ServiceA - Running process B in service A 2017-06-15 14:03:44,452 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO z.servicea.services.ServiceB - Calling viaC on Service B 2017-06-15 14:03:45,005 [3ccbf053cdafa01c/3ccbf053cdafa01c] [http-nio-8080-exec-1] INFO z.servicea.services.ServiceB - Result of viaC on Service B: 500
Service B:
2017-06-15 14:03:44,693 [3ccbf053cdafa01c/3fd4c81866c7a092] [my-test-thread-pool] INFO z.s.grpc.serviceb.GrpcServiceBImpl - Service B called with viaC 2017-06-15 14:03:44,693 [3ccbf053cdafa01c/3fd4c81866c7a092] [my-test-thread-pool] INFO zipkintrace.serviceb.ServiceB - viaC was called 2017-06-15 14:03:44,704 [3ccbf053cdafa01c/811a1ec81b6ef68c] [my-test-thread-pool] INFO io.grpc.internal.ClientCallImpl - Call timeout set to '59968182502' ns, due to context deadline. Explicit call timeout was '59991494758' ns. 2017-06-15 14:03:44,989 [3ccbf053cdafa01c/3fd4c81866c7a092] [my-test-thread-pool] INFO z.serviceb.services.ServiceC - Got gRPC response from service C: 789
Service C:
017-06-15 14:03:44,831 [3ccbf053cdafa01c/811a1ec81b6ef68c] [my-test-thread-pool] INFO z.s.grpc.servicec.GrpcServiceCImpl - Service C called with callC 2017-06-15 14:03:44,831 [3ccbf053cdafa01c/811a1ec81b6ef68c] [my-test-thread-pool] INFO zipkintrace.servicec.ServiceC - Service C: Call C
Service A:
2017-06-15 14:05:02,012 [8b4a42c23ff8ca24/8b4a42c23ff8ca24] [http-nio-8080-exec-4] INFO zipkintrace.servicea.web.AController - A:HTTP -> C:HTTP 2017-06-15 14:05:02,012 [8b4a42c23ff8ca24/8b4a42c23ff8ca24] [http-nio-8080-exec-4] INFO zipkintrace.servicea.ServiceA - Running process C in service A 2017-06-15 14:05:02,294 [8b4a42c23ff8ca24/8b4a42c23ff8ca24] [http-nio-8080-exec-4] INFO z.servicea.services.ServiceC - Got response from service C: test call C
Service C:
2017-06-15 14:05:02,037 [8b4a42c23ff8ca24/cd56ee39fde2e096] [http-nio-8100-exec-3] INFO zipkintrace.servicec.ServiceC - Processing in Service C
nice report!
Short summary of what I want to accomplish: We are building gRPC services and want to automatically include traceId and spanId in every log entry we emit from the service so we can correlate them. The logging library we use is Log4j2.
Brave libraries we use:
Here is how I set up the gRPC server to use Brave:
We have a custom log4j layout class that spits out JSON. The relevant piece of code is something like this:
However
ThreadContext.get("traceId")
seems to be null always. I've stepped through the debugger and it looks likeThreadContextCurrentTraceContext
is populatingTraceContext
with the correct trace and span in the server interceptor. Maybe this is due to the internal threading model of gRPC? Since the thread that runs the interceptor is different from the thread that runs the API code. They did mention something along these lines at the bottom of this page https://logging.apache.org/log4j/2.x/manual/thread-context.html. What should be the right way to achieve this? Should I implement my ownCurrentTraceContext
that puts tracing information into a gRPC context instead?