open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.9k stars 829 forks source link

Tracing Context Breaks When onNext() Method Invokes Multiple Times for a single request #12317

Open NaveenRamu opened 1 day ago

NaveenRamu commented 1 day ago

Describe the bug

When using the Subscriber interface with OpenTelemetry (Otel) tracing in a reactive environment, the tracing continuity breaks when the onNext() method is invoked multiple times (e.g., during message processing in chunks). The trace context is lost, leading to incomplete or broken traces. This is likely because the tracing context, which is stored in thread-local variables, is not being propagated correctly across asynchronous boundaries. and generate a new tracing ID for the next invoke.

Similarly when the onComplete() method is invoked at the end of the request new trace ID will be generated.

Implementation Class example: `public abstract class AbstractSubscriber implements Subscriber {

@OverRide public void onSubscribe(Subscription subscription) { logger.debug("onSubscribe"); this.subscription = subscription; subscription.request(1); }

@OverRide public void onNext(ByteBuf byteBuf) { logger.debug("onNext"); process(byteBuf); if (!isError) { byteBuf.release(); subscription.request(1); } else { byteBuf.release(); subscription.cancel(); ctx.error(cause); } }

@OverRide public void onError(Throwable throwable) { logger.debug("onError"); ctx.error(throwable); }

@OverRide public void onComplete() { requestComplete(); }

}`

Steps to reproduce

Steps to Reproduce:

Implement a Subscriber that processes messages in chunks by invoking onNext() multiple times. Set up OpenTelemetry tracing to trace the process. Notice that after multiple onNext() calls, the trace context is lost, leading to incomplete traces in the distributed trace logs and generating a new trace ID.

Expected behavior

The trace context should propagate correctly across multiple invocations of onNext(), maintaining continuity in the tracing logs.

Actual behavior

New thread ids generating for the RatPack override methods like onNext(). OnError(), onComplete().

Javaagent or library instrumentation version

otelcol version 0.106.1

Environment

JDK: 1.8 OS: CentOs Server: Ratpack (version: Ratpack 2.0.0-rc-1)

Additional context

The issue is likely due to asynchronous execution switching threads and losing the thread-local trace context. Reactive systems typically involve multiple threads, and if the context is not propagated, traces may appear incomplete or broken.

laurit commented 1 day ago

Please provide a minimal sample application that reproduces your issue along with any necessary instructions.

NaveenRamu commented 12 hours ago

Thanks @laurit for the quick reply. Please find the sample application below to reproduce the issue. Steps to reproduce:

  1. Import project.
  2. Download the Java agent jar
  3. Add the mentioned JVM args.
  4. run application.
  5. check different trace IDs in console logs

JVM args: -Xms512m -Xmx2048m -XX:MaxDirectMemorySize=1g -XX:+HeapDumpOnOutOfMemoryError -Djdk.tls.maxCertificateChainLength=20 -XX:+UseParallelGC -javaagent:D:/opentelemetry-javaagent.jar -Dotel.service.name=receiver -Dotel.javaagent.debug=true -Dotel.traces.exporter=none -Dotel.resource.attributes=service.name=receiver,environment=test,application=receiver -Dotel.exporter.otlp.traces.protocol=grpc -Dotel.metrics.exporter=none -Dotel.logs.exporter=none -Dotel.instrumentation.methods.include=com.datastreaming.handler.AbstractHandler[handle];com.datastreaming.handler.GlobalHandler[handleRequest];com.datastreaming.subscriber.JsonSubscriber[process,sendMessagePulsar,requestComplete];com.datastreaming.handler.AbstractSubscriber[onSubscribe,onNext,onError,onComplete]

Git link for sample code: https://github.com/NaveenRamu/datastreaming.git

laurit commented 9 hours ago

@NaveenRamu I tried your application with current development version of agent. Set traces exporter to console, started your app, and made a request to http://localhost:5050/. The output is

[otel.javaagent 2024-09-24 17:30:03:167 +0300] [ratpack-compute-1-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'AbstractSubscriber.onSubscribe' : 7157d620876d77ff1f1e551e358250d5 09e3bd5ad1e2344c INTERNAL [tracer: io.opentelemetry.methods:2.9.0-alpha-SNAPSHOT] AttributesMap{data={code.function=onSubscribe, thread.id=18, code.namespace=com.datastreaming.handler.AbstractSubscriber, thread.name=ratpack-compute-1-2}, capacity=128, totalAddedValues=4}
[otel.javaagent 2024-09-24 17:30:03:168 +0300] [ratpack-compute-1-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'JsonSubscriber.requestComplete' : 7157d620876d77ff1f1e551e358250d5 a45ffb37171c565a INTERNAL [tracer: io.opentelemetry.methods:2.9.0-alpha-SNAPSHOT] AttributesMap{data={code.function=requestComplete, thread.id=18, code.namespace=com.datastreaming.subscriber.JsonSubscriber, thread.name=ratpack-compute-1-2}, capacity=128, totalAddedValues=4}
[otel.javaagent 2024-09-24 17:30:03:170 +0300] [ratpack-compute-1-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'AbstractSubscriber.onComplete' : 7157d620876d77ff1f1e551e358250d5 fb57540d0729e99a INTERNAL [tracer: io.opentelemetry.methods:2.9.0-alpha-SNAPSHOT] AttributesMap{data={code.function=onComplete, thread.id=18, code.namespace=com.datastreaming.handler.AbstractSubscriber, thread.name=ratpack-compute-1-2}, capacity=128, totalAddedValues=4}
[otel.javaagent 2024-09-24 17:30:03:171 +0300] [ratpack-compute-1-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'GlobalHandler.handleRequest' : 7157d620876d77ff1f1e551e358250d5 8dabffe62bda0ad5 INTERNAL [tracer: io.opentelemetry.methods:2.9.0-alpha-SNAPSHOT] AttributesMap{data={code.function=handleRequest, thread.id=18, code.namespace=com.datastreaming.handler.GlobalHandler, thread.name=ratpack-compute-1-2}, capacity=128, totalAddedValues=4}
[otel.javaagent 2024-09-24 17:30:03:171 +0300] [ratpack-compute-1-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'AbstractHandler.handle' : 7157d620876d77ff1f1e551e358250d5 feac27808e65e1eb INTERNAL [tracer: io.opentelemetry.methods:2.9.0-alpha-SNAPSHOT] AttributesMap{data={code.function=handle, thread.id=18, code.namespace=com.datastreaming.handler.AbstractHandler, thread.name=ratpack-compute-1-2}, capacity=128, totalAddedValues=4}
[otel.javaagent 2024-09-24 17:30:03:173 +0300] [ratpack-compute-1-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'GET' : 7157d620876d77ff1f1e551e358250d5 dbf72c24cb798b40 SERVER [tracer: io.opentelemetry.netty-4.1:2.9.0-alpha-SNAPSHOT] AttributesMap{data={http.request.method=GET, server.port=5050, server.address=localhost, client.address=0:0:0:0:0:0:0:1, thread.id=18, network.peer.address=0:0:0:0:0:0:0:1, url.path=/, http.response.status_code=200, network.protocol.version=1.1, network.peer.port=51356, user_agent.original=curl/8.7.1, url.scheme=http, thread.name=ratpack-compute-1-2}, capacity=128, totalAddedValues=13}

As far as I can tell there is only one trace (7157d620876d77ff1f1e551e358250d5).