openzipkin / zipkin-aws

Reporters and collectors for use in Amazon's cloud
Apache License 2.0
69 stars 34 forks source link

NullPointerException in TracingExecutionInterceptor during retries #158

Closed wpan1 closed 4 years ago

wpan1 commented 4 years ago

Hey 👋,

Using TracingExecutionInterceptor with SqsAsyncClient can throw a NPE during retries when there is a HTTP 500 response.

TracingExecutionInterceptor throws NPE in beforeUnmarshalling() when clientSpan is null. This happens during retries as modifyHttpRequest() is only called once initially to setup the clientSpan. Further retry calls do not hit modifyHttpRequest(), which means once beforeUnmarshalling() clears clientSpan, the consequent retry responses have a null clientSpan.

Here is a sandbox repo and test to help you reproduce the error.

Stack trace:

java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: null

    at software.amazon.awssdk.utils.CompletableFutureUtils.errorAsCompletionException(CompletableFutureUtils.java:60)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncExecutionFailureExceptionReportingStage.lambda$execute$0(AsyncExecutionFailureExceptionReportingStage.java:51)
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:74)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeAttemptExecute(AsyncRetryableStage.java:85)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:144)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:125)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:122)
    at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.lambda$exceptionCaught$0(ResponseHandler.java:137)
    at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:171)
    at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.exceptionCaught(ResponseHandler.java:137)
    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:300)
    at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:825)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.channelRead(HandlerPublisher.java:396)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at software.amazon.awssdk.http.nio.netty.internal.nrs.HttpStreamsHandler.channelRead(HttpStreamsHandler.java:196)
    at software.amazon.awssdk.http.nio.netty.internal.nrs.HttpStreamsClientHandler.channelRead(HttpStreamsClientHandler.java:173)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at software.amazon.awssdk.http.nio.netty.internal.LastHttpContentHandler.channelRead(LastHttpContentHandler.java:43)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:227)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:308)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:422)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.lang.Thread.run(Thread.java:748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: null
    at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
    at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:201)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:143)
    ... 60 more
Caused by: java.lang.NullPointerException
    at brave.instrumentation.awsv2.TracingExecutionInterceptor.beforeUnmarshalling(TracingExecutionInterceptor.java:126)
    at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$beforeUnmarshalling$8(ExecutionInterceptorChain.java:155)
    at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.reverseForEach(ExecutionInterceptorChain.java:217)
    at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.beforeUnmarshalling(ExecutionInterceptorChain.java:155)
    at software.amazon.awssdk.core.internal.http.async.AsyncAfterTransmissionInterceptorCallingResponseHandler.beforeUnmarshalling(AsyncAfterTransmissionInterceptorCallingResponseHandler.java:57)
    at software.amazon.awssdk.core.internal.http.async.AsyncAfterTransmissionInterceptorCallingResponseHandler.onHeaders(AsyncAfterTransmissionInterceptorCallingResponseHandler.java:67)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onHeaders(MakeAsyncHttpRequestStage.java:112)
    at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelRead0(ResponseHandler.java:89)
    at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelRead0(ResponseHandler.java:66)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
    ... 48 more
anuraaga commented 4 years ago

Thanks for reporting - just for reference, it reminds me of https://github.com/openzipkin/zipkin-aws/issues/147 but now for v2.

anuraaga commented 4 years ago

Though found we do have a comment mentioning retries that may not reflect reality

https://github.com/openzipkin/zipkin-aws/blob/6f9e011661ea88eeacbcd3caf4b71a4ec09aecb4/brave-instrumentation/aws-java-sdk-v2-core/src/main/java/brave/instrumentation/awsv2/TracingExecutionInterceptor.java#L82

@wpan1 By the way what version of the AWS sdk are you using? Want to confirm it's not something that changed in a version update

wpan1 commented 4 years ago

I'm using 2.10.73

sguillope commented 4 years ago

Though found we do have a comment mentioning retries that may not reflect reality

Yeah we thought that was odd since the javadoc on AWS SDK's ExecutionInterceptor#modifyHttpRequest does not indicate potential for multiple invocations (whereas it does for some other methods)

anuraaga commented 4 years ago

Investigated the SDK and I don't think it's currently possible to instrument it such that we can model each retry as a separate span as we currently are trying to do. I filed a feature request to see if the SDK can gain a knob for this

https://github.com/aws/aws-sdk-java-v2/issues/1741

I guess an NPE is worse than not being able to model the retries - @adriancole do you think we can switch to just having a single client span?

codefromthecrypt commented 4 years ago

single client span happens in other instrumentation who are blind to retry handling, so sounds fine to me until a knob becomes available