open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.9k stars 829 forks source link

Context propagation via SQS not working #7120

Open syr opened 1 year ago

syr commented 1 year ago

Describe the bug Even though context propagation via SQS should have been supported since https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/6199 was released I cannot confirm that it works in a real application.

In short: spans created for sqs.sendMessage and sqs.receiveMessage are not associated with the same traceId but have different ones.

Steps to reproduce I put quite some effort to build a reproducer (based on https://github.com/quarkusio/quarkus-quickstarts/tree/main/amazon-sqs-quickstart) to make it as simple as possible to see the issue: https://github.com/syr/sqs-otel-context-propagation-repro

The necessary setup steps are documented in README.md

What did you expect to see? spans created for sqs.sendMessage and sqs.receiveMessage to have same traceId, see https://github.com/syr/sqs-otel-context-propagation-repro#expected-behaviour

What did you see instead? spans created for sqs.sendMessage and sqs.receiveMessage have different traceId, see https://github.com/syr/sqs-otel-context-propagation-repro#actual-behaviour

span of sqs.receiveMessage also seems to have no parent relation to span of sqs.sendMessage, as far as I could see in the debugger.

What version are you using? opentelemetry-javaagent - version: 1.19.2

Environment Compiler: openjdk version "17.0.2" 2022-01-18 OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8) OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode, sharing)

OS: Win10

Java Lib: Quarkus 2.13.3.Final

laurit commented 1 year ago

Sqs context propagation uses message system attribute AWSTraceHeader. As far as I know localstack does not support it, there is a pro version of localstack where supposedly this works. I'd firstly verify whether localstack allows passing AWSTraceHeader. In our tests we use SQSRestServerBuilder instead of localstack, see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/a4938a70e051d621b01f27b877e10bb9c36b8b94/instrumentation/aws-sdk/aws-sdk-2.2/testing/src/main/groovy/io/opentelemetry/instrumentation/awssdk/v2_2/AbstractAws2SqsTracingTest.groovy#L49 The receive span created is only active for the duration of receiveMessage call. In https://github.com/syr/sqs-otel-context-propagation-repro/blob/4b765d25b18ded970216197d02c50033fcd459cf/src/main/java/org/acme/sqs/QuarksShieldSyncResource.java#L45 you seem to be expecting it to be active after call to receiveMessage has completed.

syr commented 1 year ago

Sqs context propagation uses message system attribute AWSTraceHeader. As far as I know localstack does not support it, there is a pro version of localstack where supposedly this works. I'd firstly verify whether localstack allows passing AWSTraceHeader. In our tests we use SQSRestServerBuilder instead of localstack, see

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/a4938a70e051d621b01f27b877e10bb9c36b8b94/instrumentation/aws-sdk/aws-sdk-2.2/testing/src/main/groovy/io/opentelemetry/instrumentation/awssdk/v2_2/AbstractAws2SqsTracingTest.groovy#L49

Hi @laurit, thanks for the input. I tested it with real SQS and see the AWSTraceHeader in message system attributes.

The receive span created is only active for the duration of receiveMessage call. In https://github.com/syr/sqs-otel-context-propagation-repro/blob/4b765d25b18ded970216197d02c50033fcd459cf/src/main/java/org/acme/sqs/QuarksShieldSyncResource.java#L45 you seem to be expecting it to be active after call to receiveMessage has completed.

Yes it was just a desperate test as I was hoping that after receiveMessage the Span.current().getSpanContext() is updated by trace infos from the message by otel agent auto-instrumentation, but it is obviously not.

I created another reproducer which is much closer to the actual system. Essentially its two quartz jobs, sending/receiving to/from SQS. There are many question marks in my head.

  1. I wonder why traceId is not restored (by otel agent auto instrumentation) after sqs.receiveMessage which returns message containing traceheader? Logs from new reproducer https://github.com/syr/sqs-otel-quartz
10:25:22 INFO  traceId=ca85183e8a29b85686c24c3e22f65bc8 spanId=e27da030cd99ee46 [or.ac.qu.SqsSendReceive] (vert.x-worker-thread-0) message sent     ID=2a8a628d-cc37-4bca-95e7-a8bf3cdc753a
10:25:23 INFO  traceId=c9f4aecfd13c6e9a5ea6d26963854a3e spanId=30abf52f43fd055d [or.ac.qu.SqsSendReceive] (vert.x-worker-thread-0) message received ID=2a8a628d-cc37-4bca-95e7-a8bf3cdc753a
10:25:23 INFO  traceId=c9f4aecfd13c6e9a5ea6d26963854a3e spanId=30abf52f43fd055d [or.ac.qu.SqsSendReceive] (vert.x-worker-thread-0) message system attributes: AWSTraceHeader="Root=1-ca85183e-8a29b85686c24c3e22f65bc8;Parent=65bfb1100c02c86a;Sampled=1"
10:25:23 INFO  traceId=c9f4aecfd13c6e9a5ea6d26963854a3e spanId=58933b9ad0e77cd6 [or.ac.qu.SqsSendReceive] (vert.x-worker-thread-0) postReceive

Or can this only be done by manual instrumentation by SpanContext.createFromRemoteParent.. like in https://stackoverflow.com/a/72826169/753724 ? (Will add this to the reproducer soon, to test if it is a possible workaround.)

Any example where AWSTraceHeader from received SQS message leads to a correctly continuation of remote trace, thus reflecting Span.current().getSpanContext()?

  1. How to get MDCs updated in a less cumbersome/invasive way which is atm
            MDC.put("traceId", Span.current().getSpanContext().getTraceId());
            MDC.put("spanId", Span.current().getSpanContext().getSpanId());

    everytime a new span is created. Sure I could hide this in an interceptor, however, I don't understand why otel agent is not just keeping MDCs up-to-date

  2. when using async sqs client, sqsAsync.deleteMessage is not instrumented by otel agent..thus mussing in the trace
mateuszrzeszutek commented 1 year ago

How to get MDCs updated in a less cumbersome/invasive way which is atm ... Sure I could hide this in an interceptor, however, I don't understand why otel agent is not just keeping MDCs up-to-date

It is though, see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/logger-mdc-instrumentation.md

syr commented 1 year ago

How to get MDCs updated in a less cumbersome/invasive way which is atm ... Sure I could hide this in an interceptor, however, I don't understand why otel agent is not just keeping MDCs up-to-date

It is though, see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/logger-mdc-instrumentation.md

Hm, but then it is obviously not working with jboss logger used by quarkus (io.quarkus.logging.Log), or I don't see the point? If I remove the MDC.put, traceId and spanId are always empty

Maybe this detail mentioned in https://stackoverflow.com/a/65040601/753724 is making problems with the logger mdc instrumentation: Quarkus uses JBoss LogManager as the logger, and its MDC isn't an ordinary ThreadLocal, it's an InheritableThreadLocal

mateuszrzeszutek commented 1 year ago

Hm, but then it is obviously not working with jboss logger used by quarkus (io.quarkus.logging.Log), or I don't see the point? If I remove the MDC.put, traceId and spanId are always empty

Hmm, yeah, we do not instrument JBoss logger MDC right now. That's a missing feature.

laurit commented 1 year ago

I wonder why traceId is not restored (by otel agent auto instrumentation) after sqs.receiveMessage which returns message containing traceheader?

That wouldn't be a good idea. If we are going to set the active trace/span on the thread we need to remove it and restore the original trace at some point after receiveMessage has exited we won't know when to do this. From tracing perspective a callback that processes the message is much better. There we know exactly what message is being processed and when the processing ends. That being said you could extract the trace from the message and set it as the current trace manually as you know where in your code the message is processed.

Hm, but then it is obviously not working with jboss logger used by quarkus (io.quarkus.logging.Log), or I don't see the point? If I remove the MDC.put, traceId and spanId are always empty

We have instrumentation for jboss logging mdc https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/jboss-logmanager/jboss-logmanager-mdc-1.1/javaagent that uses org.jboss.logmanager:jboss-logmanager:1.1.0.GA but your sample app uses org.jboss.logmanager:jboss-logmanager-embedded:jar:1.0.10. Someone needs to figure out whether new instrumentation is needed for that.