open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.88k stars 823 forks source link

Incoming spring-boot SQS message handler not using trace id from AWSTraceHeader #4788

Open rage-shadowman opened 2 years ago

rage-shadowman commented 2 years ago

Describe the bug Auto instrumentation not taking trace id from AWSTraceHeader attribute when recieving SQS message.

Steps to reproduce Spring @SqsListener method with or without @WithSpan annotation when processing incoming SQS message.

What did you expect to see? I expect the span that executes the SQS handler to have the trace id from the AWSTraceHeader attribute of the SQS message.

What did you see instead? SQS message handled within a new span with a new trace id, not related to the AWSTraceHeader.

What version are you using? v1.7.2

Environment Compiler: OpenJDK 1.8.0_252 OS: Debian buster Runtime (if different from JDK above): OpenJDK build 1.8.0_312-b07 OS (if different from OS compiled on): Debian buster

Additional context Running a spring-boot application in ECS task. Not even sure if this is supposed to work since documentation is sparse, but from looking at the sources here, it kinda looks like there is code in place to handle it, so filing as a bug report.

tmarszal commented 2 years ago

I experience similar situation. My @SqsListener method has span with different traceId generated (than the one coming with SQS message) or no span at all - depending on @WithSpan annotation being present or not.

I created relatively small application to replicate this: https://github.com/tmarszal/opentelemetry-java-instrumentation-issue-4478-demo

TL;DR what I think is missing here is auto-instrumentation around SimpleMessageListenerContainer#executeMessage.

When I start mock SQS (with elasticmq), my test app and send GET request to it (to trigger SQS message creation), I get following spans:

[otel.javaagent 2022-03-30 22:26:35:061 +0200] [http-nio-8080-exec-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'SQS.GetQueueUrl' : 85355d0cef3238ef479010fcfa0a886e a3b402da324bf29d CLIENT [tracer: io.opentelemetry.aws-sdk-1.11:1.12.1] AttributesMap{data={net.peer.port=9324, http.url=http://localhost:9324, rpc.system=aws-api, thread.name=http-nio-8080-exec-1, http.method=POST, http.status_code=200, thread.id=23, net.transport=ip_tcp, net.peer.name=localhost, http.flavor=1.1, rpc.method=GetQueueUrl, rpc.service=AmazonSQS}, capacity=128, totalAddedValues=12}
[otel.javaagent 2022-03-30 22:26:35:108 +0200] [http-nio-8080-exec-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'HTTP POST' : 85355d0cef3238ef479010fcfa0a886e 9c4ee914283c8f87 CLIENT [tracer: io.opentelemetry.apache-httpclient-4.0:1.12.1] AttributesMap{data={net.peer.port=9324, http.url=http://localhost:9324/000000000000/test-queue, thread.name=http-nio-8080-exec-1, http.method=POST, http.status_code=200, thread.id=23, net.transport=ip_tcp, net.peer.name=localhost, http.user_agent=aws-sdk-java/1.12.132 Linux/5.13.0-37-generic OpenJDK_64-Bit_Server_VM/17.0.2+8-Ubuntu-120.04 java/17.0.2 vendor/Private_Build cfg/retry-mode/legacy, http.flavor=1.1}, capacity=128, totalAddedValues=10}
[otel.javaagent 2022-03-30 22:26:35:110 +0200] [http-nio-8080-exec-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'SQS.SendMessage' : 85355d0cef3238ef479010fcfa0a886e a10f5c883b67393d PRODUCER [tracer: io.opentelemetry.aws-sdk-1.11:1.12.1] AttributesMap{data={net.peer.port=9324, http.url=http://localhost:9324, rpc.system=aws-api, thread.name=http-nio-8080-exec-1, http.method=POST, http.status_code=200, thread.id=23, net.transport=ip_tcp, net.peer.name=localhost, http.flavor=1.1, rpc.method=SendMessage, rpc.service=AmazonSQS}, capacity=128, totalAddedValues=12}
[otel.javaagent 2022-03-30 22:26:35:127 +0200] [http-nio-8080-exec-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'Controller.send' : 85355d0cef3238ef479010fcfa0a886e 831aad1a84057b97 INTERNAL [tracer: io.opentelemetry.spring-webmvc-3.1:1.12.1] AttributesMap{data={thread.name=http-nio-8080-exec-1, thread.id=23}, capacity=128, totalAddedValues=2}
[otel.javaagent 2022-03-30 22:26:35:131 +0200] [http-nio-8080-exec-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - '/send' : 85355d0cef3238ef479010fcfa0a886e 61b74a5760171311 SERVER [tracer: io.opentelemetry.tomcat-7.0:1.12.1] AttributesMap{data={net.peer.port=56858, http.scheme=http, http.method=GET, http.user_agent=curl/7.68.0, http.route=/send, http.response_content_length=0, http.host=localhost:8080, net.peer.ip=127.0.0.1, thread.name=http-nio-8080-exec-1, http.status_code=200, thread.id=23, net.transport=ip_tcp, http.flavor=1.1, http.target=/send}, capacity=128, totalAddedValues=14}
[otel.javaagent 2022-03-30 22:26:35:139 +0200] [simpleMessageListenerContainer-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'SQS.ReceiveMessage' : 85355d0cef3238ef479010fcfa0a886e 6fd7973577d9d34c CONSUMER [tracer: io.opentelemetry.aws-sdk-1.11:1.12.1] AttributesMap{data={net.peer.port=9324, http.method=POST, net.peer.name=localhost, http.user_agent=aws-sdk-java/1.12.132 Linux/5.13.0-37-generic OpenJDK_64-Bit_Server_VM/17.0.2+8-Ubuntu-120.04 java/17.0.2 vendor/Private_Build cfg/retry-mode/legacy, rpc.method=ReceiveMessage, http.url=http://localhost:9324, rpc.system=aws-api, thread.name=simpleMessageListenerContainer-1, http.status_code=200, thread.id=35, net.transport=ip_tcp, http.flavor=1.1, rpc.service=AmazonSQS}, capacity=128, totalAddedValues=13}
[otel.javaagent 2022-03-30 22:26:35:139 +0200] [simpleMessageListenerContainer-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'SQS.ReceiveMessage' : c5a607141c025a843d80c60bc22d2949 2959ba6d910a164c CLIENT [tracer: io.opentelemetry.aws-sdk-1.11:1.12.1] AttributesMap{data={net.peer.port=9324, http.url=http://localhost:9324, rpc.system=aws-api, thread.name=simpleMessageListenerContainer-1, http.method=POST, http.status_code=200, thread.id=35, net.transport=ip_tcp, net.peer.name=localhost, http.flavor=1.1, rpc.method=ReceiveMessage, rpc.service=AmazonSQS}, capacity=128, totalAddedValues=12}
[otel.javaagent 2022-03-30 22:26:35:180 +0200] [pool-1-thread-1] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'SQS.DeleteMessage' : 06307b5ee75ee5646bfc374b4457be88 1426a0ec71fe8f5c CLIENT [tracer: io.opentelemetry.aws-sdk-1.11:1.12.1] AttributesMap{data={net.peer.port=9324, http.url=http://localhost:9324, rpc.system=aws-api, thread.name=pool-1-thread-1, http.method=POST, http.status_code=200, thread.id=38, net.transport=ip_tcp, net.peer.name=localhost, http.flavor=1.1, rpc.method=DeleteMessage, rpc.service=AmazonSQS}, capacity=128, totalAddedValues=12}

My observations:

  1. First set of spans uses traceId generated by the incoming GET request, as expected
  2. Client SQS.ReceiveMessage span has new traceId and is started before request and ends after it. It covers only request to SQS and starts before we know the message (there may even be no message), so it needs to have different traceId
  3. Consumer SQS.ReceiveMessage span has traceId retrieved from the message, but it is started and ended instantly (see: TracingRequestHandler#createConsumerSpan). It looks like a good candidate to be a span spanning whole SQS message handling, but the scope of TracingRequestHandler is limited to the SQS request, so there is no way to make it work that way.
  4. Handling received message (past SQS request) is handled by spring-cloud-aws-messaging and is not auto-instrumented.
qaisa-everc commented 1 year ago

Hi, I'm facing the same exact issue when using opentelemetry-aws-sdk-1.11 together with org.springframework.cloud:spring-cloud-sleuth-otel-autoconfigure:1.1.3.

Will this issue be tackled in the near future?

yusufcemalcelebi1 commented 1 year ago

I'm having the same issue using opentelemetry auto instrumentation together with @SqsListener annotation. It generates different traceIds for ReceiveMessage and Sqs handler method. There are some db calls inside the handler method and tracing backend can't correlate these db calls with receiveMessage span. Any idea how to resolve it?

trask commented 1 year ago

maybe related? #8331

mateuszrzeszutek commented 1 year ago

I think this is probably caused by the fact that the @SqsListener is conceptually the same as a callback that gets the message(s) and processes them; and the SQS (AWS SDK) instrumentation creates zero-length CONSUMER spans for SQS messages because it kinda follows the "Kafka model" of receiving the messages.