microsoft / ApplicationInsights-Java

Application Insights for Java
http://aka.ms/application-insights
Other
297 stars 199 forks source link

W3C tracing correlating only 99% of the time using webclients under agent 3.0 #1453

Closed petergphillips closed 3 years ago

petergphillips commented 3 years ago

Expected behaviour

When one spring boot application calls out to another spring boot application using WebClient the java agent will set headers so that request can be correlated. Under 2.6.2 there was no support for WebClient so requests weren't being correlated at all. We expect all requests to be correlated correctly, so dependencies shows the outgoing operation_Id and then requests shows the incoming request with the same operation_Id.

Actual behaviour

About 1% of the requests aren't being correlated correctly. In the first application I can see that the request has generated a new operation_Id which is duplicated in the operation_ParentId column too. I can see the outgoing request in dependencies with the same operation_Id. However in the second application about 1% of the time the request gets a different operation_Id that doesn't match the first application request and can't be matched with any other requests going on at the same time. The operation_ParentId then has a completely different value too and doesn't even match the operation_Id, indicating to me that the traceparent is coming through, but with incorrect values.

To Reproduce

Make lots of calls to the first application and then check app insights. I have been doing

for i in {1..1000}; do curl http://localhost:8080/prisons/id/MDI;echo;done

and then checking app insights to see if there are any requests that aren't correlated:

requests
| where cloud_RoleName == 'prison-estate'
| project cloud_RoleName, operation_Id, operation_ParentId, name, timestamp, cloud_RoleInstance, sdkVersion, application_Version
| where name == "/prisons/id/{prisonId}"
| join kind=rightanti (requests 
| where cloud_RoleName == "prison-estate-2"
| where url == "http:/localhost:8081/health/diskSpace"
) on operation_Id

I don't seem to be able to reproduce the behaviour with two servers running locally, but can reproduce if the second server is in the cloud - tried both AWS & Azure with same results.

Sample Application

Project to replicate the issue is https://github.com/ministryofjustice/prison-estate.git. Need Java 11. Replace with your key for app insights.

The first application can run locally:

git checkout pgp-DT-1442-insights-v3.0.2
./gradlew clean assemble && cp applicationinsights.json build/libs
APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=<your instrumentation key> java -javaagent:build/libs/applicationinsights-agent-3.0.2.jar -Dspring.profiles.active=dev -Dapi.base.url=<url of spring boot application in cloud> -jar build/libs/prison-estate-*.jar

The second application can be any spring boot app connected to the same app insights running in the cloud, the first app needs to be started with that url.

System information

SDK Version: 2.6.2 Java Agent Version: 3.0.2 OS type and version: Debian 14 Application Server type and version (if applicable): Spring Boot 2.4.2 Using spring-boot? yes

petergphillips commented 3 years ago

Just realised that I could dump the headers on the second application to see more what is going on. So, for the requests that don't have a corresponding first application match we have headers such as

request-id=|77329a55eca3cb659df62f13350830ba.60634bf93c0ffe1b.,
request-context=appId=cid-v1:0d5d6f09-1d07-47f5-9d3a-f9f473544e78,
tracestate=az=cid-v1:0d5d6f09-1d07-47f5-9d3a-f9f473544e78,
traceparent=00-77329a55eca3cb659df62f13350830ba-60634bf93c0ffe1b-01,

This indicates to me that the second application is receiving all the correct W3C headers. Looking at requests for that operation_Id or operation_ParentId we only find the second application one, there is not one for the first application. Nor is there a dependencies entry for the first application with that operation_Id. It therefore appears that due to the async nature of the webclient the first application is generating a new operation_Id for the request to the second application sometime after logging the correct dependencies entry though.

trask commented 3 years ago

hey @petergphillips, are there any warnings or errors in the applicationinsights.log file on the first application? thanks!

petergphillips commented 3 years ago

hi @trask, no there aren't any of either of them. Just reproduced the bug with level set to DEBUG. The ones that aren't matched in application two have an extra 3 lines of debug loggingabout wrapping runnable task in the middle of their six log entries:

2021-01-25 08:10:15.942Z DEBUG c.m.a.agent.Exporter - exporting span: SpanWrapper{delegate=RecordEventsReadableSpan{traceId=14f19ef05a17ddca0a644bbef647fcb2, spanId=6d5581ff8fd120f7, parentSpanContext=ImmutableSpanContext{traceIdAsHexString=14f19ef05a17ddca0a644bbef647fcb2, spanIdAsHexString=222ea0b60b94111c, traceFlags=1, traceState=ArrayBasedTraceState{entries=[]}, remote=false}, name=Headers are: {host=localhost:8080, user-agent=curl/7.58.0, accept=*/*}, kind=INTERNAL, attributes=AttributesMap{data={AttributeKeyImpl{getType=BOOLEAN, key=applicationinsights.internal.log}=true, AttributeKeyImpl{getType=STRING, key=applicationinsights.internal.log_level}=INFO, AttributeKeyImpl{getType=STRING, key=applicationinsights.internal.logger_name}=uk.gov.justice.digital.hmpps.prisonestate.resource.PrisonResource$Companion, AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8}, capacity=1000, totalAddedValues=5}, status=ImmutableStatus{statusCode=UNSET, description=null}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1611562215942173113, endEpochNanos=1611562215942175232}, resolvedLinks=[], resolvedEvents=[], attributes=AttributesMap{data={AttributeKeyImpl{getType=BOOLEAN, key=applicationinsights.internal.log}=true, AttributeKeyImpl{getType=STRING, key=applicationinsights.internal.log_level}=INFO, AttributeKeyImpl{getType=STRING, key=applicationinsights.internal.logger_name}=uk.gov.justice.digital.hmpps.prisonestate.resource.PrisonResource$Companion, AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8}, capacity=1000, totalAddedValues=5}, totalAttributeCount=5, totalRecordedEvents=0, status=ImmutableStatus{statusCode=UNSET, description=null}, name=Headers are: {host=localhost:8080, user-agent=curl/7.58.0, accept=*/*}, endEpochNanos=1611562215942175232, internalHasEnded=true}
2021-01-25 08:10:15.943Z DEBUG c.m.a.agent.Exporter - exporting span: SpanWrapper{delegate=RecordEventsReadableSpan{traceId=14f19ef05a17ddca0a644bbef647fcb2, spanId=5bfa1fc25abb2aa9, parentSpanContext=ImmutableSpanContext{traceIdAsHexString=14f19ef05a17ddca0a644bbef647fcb2, spanIdAsHexString=222ea0b60b94111c, traceFlags=1, traceState=ArrayBasedTraceState{entries=[]}, remote=false}, name=SELECT prison-estate-db.prison, kind=CLIENT, attributes=AttributesMap{data={AttributeKeyImpl{getType=STRING, key=db.system}=h2, AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=db.user}=sa, AttributeKeyImpl{getType=STRING, key=db.name}=prison-estate-db, AttributeKeyImpl{getType=STRING, key=db.connection_string}=h2:mem:, AttributeKeyImpl{getType=STRING, key=db.statement}=select prison0_.prison_id as prison_i1_0_0_, prison0_.active as active2_0_0_, prison0_.name as name3_0_0_ from prison prison0_ where prison0_.prison_id=?}, capacity=1000, totalAddedValues=7}, status=ImmutableStatus{statusCode=UNSET, description=null}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1611562215943209739, endEpochNanos=1611562215943286675}, resolvedLinks=[], resolvedEvents=[], attributes=AttributesMap{data={AttributeKeyImpl{getType=STRING, key=db.system}=h2, AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=db.user}=sa, AttributeKeyImpl{getType=STRING, key=db.name}=prison-estate-db, AttributeKeyImpl{getType=STRING, key=db.connection_string}=h2:mem:, AttributeKeyImpl{getType=STRING, key=db.statement}=select prison0_.prison_id as prison_i1_0_0_, prison0_.active as active2_0_0_, prison0_.name as name3_0_0_ from prison prison0_ where prison0_.prison_id=?}, capacity=1000, totalAddedValues=7}, totalAttributeCount=7, totalRecordedEvents=0, status=ImmutableStatus{statusCode=UNSET, description=null}, name=SELECT prison-estate-db.prison, endEpochNanos=1611562215943286675, internalHasEnded=true}
2021-01-25 08:10:15.944Z DEBUG c.m.a.agent.Exporter - exporting span: SpanWrapper{delegate=RecordEventsReadableSpan{traceId=14f19ef05a17ddca0a644bbef647fcb2, spanId=46fdf6dde958442d, parentSpanContext=ImmutableSpanContext{traceIdAsHexString=14f19ef05a17ddca0a644bbef647fcb2, spanIdAsHexString=222ea0b60b94111c, traceFlags=1, traceState=ArrayBasedTraceState{entries=[]}, remote=false}, name=SELECT prison-estate-db.prison_gp_practice, kind=CLIENT, attributes=AttributesMap{data={AttributeKeyImpl{getType=STRING, key=db.system}=h2, AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=db.user}=sa, AttributeKeyImpl{getType=STRING, key=db.name}=prison-estate-db, AttributeKeyImpl{getType=STRING, key=db.connection_string}=h2:mem:, AttributeKeyImpl{getType=STRING, key=db.statement}=select prisongppr0_.prison_id as prison_i1_1_0_, prisongppr0_.gp_practice_code as gp_pract2_1_0_ from prison_gp_practice prisongppr0_ where prisongppr0_.prison_id=?}, capacity=1000, totalAddedValues=7}, status=ImmutableStatus{statusCode=UNSET, description=null}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1611562215944077235, endEpochNanos=1611562215944132387}, resolvedLinks=[], resolvedEvents=[], attributes=AttributesMap{data={AttributeKeyImpl{getType=STRING, key=db.system}=h2, AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=db.user}=sa, AttributeKeyImpl{getType=STRING, key=db.name}=prison-estate-db, AttributeKeyImpl{getType=STRING, key=db.connection_string}=h2:mem:, AttributeKeyImpl{getType=STRING, key=db.statement}=select prisongppr0_.prison_id as prison_i1_1_0_, prisongppr0_.gp_practice_code as gp_pract2_1_0_ from prison_gp_practice prisongppr0_ where prisongppr0_.prison_id=?}, capacity=1000, totalAddedValues=7}, totalAttributeCount=7, totalRecordedEvents=0, status=ImmutableStatus{statusCode=UNSET, description=null}, name=SELECT prison-estate-db.prison_gp_practice, endEpochNanos=1611562215944132387, internalHasEnded=true}
2021-01-25 08:10:15.945Z DEBUG i.o.j.i.a.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$MonoChannelPromise$$Lambda$1572/0x0000000840eaec40@3d7ce303
2021-01-25 08:10:15.948Z DEBUG i.o.j.i.a.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$$Lambda$1576/0x0000000840ead040@2904c06a
2021-01-25 08:10:15.948Z DEBUG i.o.j.i.a.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$$Lambda$1576/0x0000000840ead040@2904c06a
2021-01-25 08:10:16.051Z DEBUG c.m.a.agent.Exporter - exporting span: SpanWrapper{delegate=RecordEventsReadableSpan{traceId=14f19ef05a17ddca0a644bbef647fcb2, spanId=357900a6cc6c9325, parentSpanContext=ImmutableSpanContext{traceIdAsHexString=14f19ef05a17ddca0a644bbef647fcb2, spanIdAsHexString=222ea0b60b94111c, traceFlags=1, traceState=ArrayBasedTraceState{entries=[]}, remote=false}, name=HTTP GET, kind=CLIENT, attributes=AttributesMap{data={AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=net.transport}=IP.TCP, AttributeKeyImpl{getType=STRING, key=http.method}=GET, AttributeKeyImpl{getType=STRING, key=http.flavor}=1.1, AttributeKeyImpl{getType=STRING, key=net.peer.name}=sign-in-dev.hmpps.service.justice.gov.uk, AttributeKeyImpl{getType=STRING, key=http.url}=https://sign-in-dev.hmpps.service.justice.gov.uk/auth/health/diskSpace, AttributeKeyImpl{getType=LONG, key=http.status_code}=200, AttributeKeyImpl{getType=STRING, key=applicationinsights.internal.target_app_id}=cid-v1:0d5d6f09-1d07-47f5-9d3a-f9f473544e78}, capacity=1000, totalAddedValues=9}, status=ImmutableStatus{statusCode=UNSET, description=null}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1611562215944931481, endEpochNanos=1611562216051418416}, resolvedLinks=[], resolvedEvents=[], attributes=AttributesMap{data={AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=net.transport}=IP.TCP, AttributeKeyImpl{getType=STRING, key=http.method}=GET, AttributeKeyImpl{getType=STRING, key=http.flavor}=1.1, AttributeKeyImpl{getType=STRING, key=net.peer.name}=sign-in-dev.hmpps.service.justice.gov.uk, AttributeKeyImpl{getType=STRING, key=http.url}=https://sign-in-dev.hmpps.service.justice.gov.uk/auth/health/diskSpace, AttributeKeyImpl{getType=LONG, key=http.status_code}=200, AttributeKeyImpl{getType=STRING, key=applicationinsights.internal.target_app_id}=cid-v1:0d5d6f09-1d07-47f5-9d3a-f9f473544e78}, capacity=1000, totalAddedValues=9}, totalAttributeCount=9, totalRecordedEvents=0, status=ImmutableStatus{statusCode=UNSET, description=null}, name=HTTP GET, endEpochNanos=1611562216051418416, internalHasEnded=true}
2021-01-25 08:10:16.053Z DEBUG c.m.a.agent.Exporter - exporting span: SpanWrapper{delegate=RecordEventsReadableSpan{traceId=14f19ef05a17ddca0a644bbef647fcb2, spanId=222ea0b60b94111c, parentSpanContext=ImmutableSpanContext{traceIdAsHexString=14f19ef05a17ddca0a644bbef647fcb2, spanIdAsHexString=7b07e793cff7d874, traceFlags=1, traceState=ArrayBasedTraceState{entries=[]}, remote=false}, name=PrisonResource.getPrisonFromId, kind=INTERNAL, attributes=AttributesMap{data={AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8}, capacity=1000, totalAddedValues=2}, status=ImmutableStatus{statusCode=UNSET, description=null}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1611562215941693725, endEpochNanos=1611562216053058823}, resolvedLinks=[], resolvedEvents=[], attributes=AttributesMap{data={AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8}, capacity=1000, totalAddedValues=2}, totalAttributeCount=2, totalRecordedEvents=0, status=ImmutableStatus{statusCode=UNSET, description=null}, name=PrisonResource.getPrisonFromId, endEpochNanos=1611562216053058823, internalHasEnded=true}
2021-01-25 08:10:16.053Z DEBUG c.m.a.agent.Exporter - exporting span: SpanWrapper{delegate=RecordEventsReadableSpan{traceId=14f19ef05a17ddca0a644bbef647fcb2, spanId=7b07e793cff7d874, parentSpanContext=ImmutableSpanContext{traceIdAsHexString=00000000000000000000000000000000, spanIdAsHexString=0000000000000000, traceFlags=0, traceState=ArrayBasedTraceState{entries=[]}, remote=false}, name=/prisons/id/{prisonId}, kind=SERVER, attributes=AttributesMap{data={AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=net.peer.ip}=127.0.0.1, AttributeKeyImpl{getType=LONG, key=net.peer.port}=58314, AttributeKeyImpl{getType=STRING, key=http.method}=GET, AttributeKeyImpl{getType=STRING, key=http.user_agent}=curl/7.58.0, AttributeKeyImpl{getType=STRING, key=http.url}=http://localhost:8080/prisons/id/MDI, AttributeKeyImpl{getType=STRING, key=http.flavor}=HTTP/1.1, AttributeKeyImpl{getType=STRING, key=http.client_ip}=127.0.0.1, AttributeKeyImpl{getType=LONG, key=http.status_code}=200}, capacity=1000, totalAddedValues=10}, status=ImmutableStatus{statusCode=UNSET, description=null}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1611562215941001222, endEpochNanos=1611562216053479083}, resolvedLinks=[], resolvedEvents=[], attributes=AttributesMap{data={AttributeKeyImpl{getType=LONG, key=thread.id}=140, AttributeKeyImpl{getType=STRING, key=thread.name}=http-nio-8080-exec-8, AttributeKeyImpl{getType=STRING, key=net.peer.ip}=127.0.0.1, AttributeKeyImpl{getType=LONG, key=net.peer.port}=58314, AttributeKeyImpl{getType=STRING, key=http.method}=GET, AttributeKeyImpl{getType=STRING, key=http.user_agent}=curl/7.58.0, AttributeKeyImpl{getType=STRING, key=http.url}=http://localhost:8080/prisons/id/MDI, AttributeKeyImpl{getType=STRING, key=http.flavor}=HTTP/1.1, AttributeKeyImpl{getType=STRING, key=http.client_ip}=127.0.0.1, AttributeKeyImpl{getType=LONG, key=http.status_code}=200}, capacity=1000, totalAddedValues=10}, totalAttributeCount=10, totalRecordedEvents=0, status=ImmutableStatus{statusCode=UNSET, description=null}, name=/prisons/id/{prisonId}, endEpochNanos=1611562216053479083, internalHasEnded=true}

Looking at the log file the only times that Wrapping runnable task appears is when the correlation doesn't then work correctly. Also none of the operation_Id values from the second application appear in the applicationinsights.log.

trask commented 3 years ago

hey @petergphillips, I think this was resolved very recently upstream (https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/2156), and that fix is in the latest SNAPSHOT: applicationinsights-agent-3.0.3-BETA-SNAPSHOT.jar.zip in case you have a chance to give it a try.

petergphillips commented 3 years ago

Hi @trask, just had a try with that latest snapshot. Still same result I'm afraid in that the same proportion aren't being matched. However with DEBUG now on I can see in the logs:

2021-02-11 13:51:01.041Z DEBUG i.o.j.i.a.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$MonoChannelPromise$$Lambda$1594/0x0000000840ea4c40@639050d5
2021-02-11 13:51:01.043Z DEBUG i.o.j.bootstrap.ExceptionLogger - Failed to handle exception in instrumentation for io.netty.channel.DefaultChannelPipeline on TomcatEmbeddedWebappClassLoader
  context: ROOT
  delegate: true
----------> Parent Classloader:
org.springframework.boot.loader.LaunchedURLClassLoader@2cd62003

java.lang.NullPointerException: null
    at io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:182)
    at io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:358)
    at io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:339)
    at reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnectionAllocator$PooledConnectionInitializer.initChannel(DefaultPooledConnectionProvider.java:555)
    at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:129)
    at io.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:112)
    at io.netty.channel.AbstractChannelHandlerContext.callHandlerAdded(AbstractChannelHandlerContext.java:938)
    at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:609)
    at io.netty.channel.DefaultChannelPipeline.access$100(DefaultChannelPipeline.java:46)
    at io.netty.channel.DefaultChannelPipeline$PendingHandlerAddedTask.execute(DefaultChannelPipeline.java:1463)
    at io.netty.channel.DefaultChannelPipeline.callHandlerAddedForAllHandlers(DefaultChannelPipeline.java:1115)
    at io.netty.channel.DefaultChannelPipeline.invokeHandlerAddedIfNeeded(DefaultChannelPipeline.java:650)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:502)

so it looks like something is going wrong with the wrapping and therefore the correlation?

trask commented 3 years ago

hey @petergphillips, there was another fix upstream related to context correlation in netty, and new support for reactor-netty that we pulled in to our distribution.

do you mind giving the latest snapshot another try? applicationinsights-agent-3.0.3-BETA-SNAPSHOT.jar.zip

petergphillips commented 3 years ago

hi @trask, just had a go with the latest snapshot and all seems to be working fine now. All the requests are being correlated correctly :+1:

trask commented 3 years ago

oh, that's awesome! thanks for your patience, planning to release 3.0.3-BETA end of this week or beginning of next week

trask commented 3 years ago

3.0.3-BETA is released!