open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.93k stars 845 forks source link

Context across threads: Kafka producer interceptor related span as child #6297

Closed ppatierno closed 2 years ago

ppatierno commented 2 years ago

I was comparing how OpenTracing (with Jaeger client) and OpenTelemetry work when it comes to share context across threads in the same application.

We are currently using OpenTracing in our Strimzi HTTP - Kafka bridge, where we create a span when an HTTP request comes in for sending a message to Kafka and then leaving the Kafka instrumented library to create the corresponding Kafka operation span which is set as child of the HTTP request span. NOTE: The main point to highlight is that the HTTP request span is created in a thread, but the Kafka operation one is created in a different thread (because it's done through the Kafka producer interceptor mechanism).

With OpenTracing/Jeager client everything works fine because it injects an uber-trace-id in the Kafka record headers which is used in the producer interceptor thread (using just the default TracingProducerInterceptor class) to get the span context of the HTTP request and make the new span as child of it (more details here).

When moving to OpenTelemetry, we noticed that the default TracingProducerInterceptor doesn't work out of the box. It gets the current context (by using Context.current()) but as per my understanding the context is handle by a ThreadLocal based storage so what is set as context (in the thread where the HTTP request span is created) is not available in the thread executing the producer interceptor code.

The only way we found until now was creating a custom ContextAwareTracingProducerInterceptor (inheriting from TracingProducerInterceptor) and getting the HTTP request span somehow by making it as part of the current context and then invoking the base class. Something like this ...

public static class ContextAwareTracingProducerInterceptor<K, V> extends TracingProducerInterceptor<K, V> {
        @Override
        public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
            Span span = # ... get HTTP request span ...
            try (Scope ignored = span.makeCurrent()) {
                return super.onSend(record);
            }
        }
    }

I was wondering if this is the only "hacky" way to do such a thing (context across threads) with OpenTelemetry and why it isn't implemented as OpenTracing, or maybe there is something obvious we are missing.

The question could be more general: how to have context propagated across threads in the same application?

mateuszrzeszutek commented 2 years ago

Hey @ppatierno ,

NOTE: The main point to highlight is that the HTTP request span is created in a thread, but the Kafka operation one is created in a different thread

The "OpenTelemetry way" to solve this would be to attach the current context (at the time of the HTTP request processing) to the async task that you're adding to a queue/executor. If you're using an Executor or ExecutorService consider wrapping them with Context.taskWrapping() so that they automatically attach context to submitted tasks; alternatively, you can wrap the tasks (Runnable, Callable, etc) themselves using Context.wrap().

With OpenTracing/Jeager client everything works fine because it injects an uber-trace-id in the Kafka record headers which is used in the producer interceptor thread (using just the default TracingProducerInterceptor class) to get the span context of the HTTP request and make the new span as child of it (more details here).

The OpenTelemetry Context is much more than just a traceId&spanId pair (e.g. you can put baggage there, or completely custom things that you want to propagate down), and we strongly recommend propagating the entire context within a single application.

ppatierno commented 2 years ago

Thanks for the explanation @mateuszrzeszutek !

The problem I see is when you don't have access to such an executor or thread because it's inside the library you are using. Still making the point with Kafka client ...

When on a KafkaProducer you call the send method, what it does is actually putting the message into an internal buffer and then another internal thread in the library will take care of it for sending. This is the thread which executes the producer interceptor. So I have an HTTP span in the thread I have control and I call the KafkaProducer.send, but the Kafka tracing will happen in the interceptor (on a different thread) and context will be different without a way for me to "inject" the right context because a "reference" to that thread is not available somehow. It's for Kafka but could be for any other library? Is there anything I am missing?

mateuszrzeszutek commented 2 years ago

Hmm, KafkaProducer calls the interceptor as the first thing when send() is called -- and our interceptor creates the producer span onSend(). Are you sure you're not using some kind of async wrapper over it?

ppatierno commented 2 years ago

Hey @mateuszrzeszutek I actually wrote this right after my previous comment but forgot to click on "Comment" button! LOL Yeah it's in the same thread, I got confused by the underneath network thread for reading batch and sending messages. So the problem is the Vert.x wrapper usage around it. In this case we have a workaround about injecting an ExecutorService wrapped with the right context. Even if I will investigate why at least the interceptor part in Vert.x is not done in the right thread but IIRC it just wraps around the native send in an executeBlocking so using a separate worker thread :-( Anyway I can close this because I have got what I needed, thanks!

mateuszrzeszutek commented 2 years ago

Ah, I see now. Yeah, you'll need to somehow pass the context along with the record; or implement context propagation in vertx. We don't have a library instrumentation for that, but I see that vertx has a Context concept itself - perhaps it is possible pass the OTel context using it.

ppatierno commented 2 years ago

My feeling is that we should instrument our application by using the "official" vertx-tracing support https://github.com/eclipse-vertx/vertx-tracing I will investigate but not sure how much it's completed.

ppatierno commented 2 years ago

Vert.x seems to have a TP support for tracing here https://github.com/eclipse-vertx/vertx-tracing Of course refactoring everything is not feasible on our side and the feature is TP but we could try to use the VertxContextStorageProvider instead of the default one which is based on thread local.

https://github.com/eclipse-vertx/vertx-tracing/blob/master/vertx-opentelemetry/src/main/java/io/vertx/tracing/opentelemetry/VertxContextStorageProvider.java

This Vert.x storage seems to be able to correlate Vert.x context with OpenTelemetry context and that could help. Right now I am not able to set this storage yet :-(

mateuszrzeszutek commented 2 years ago

Try setting the system property io.opentelemetry.context.contextStorageProvider to the full name of the VertxContextStorageProvider class somewhere right at the start of your application, before you first use/load OpenTelemetry.

ppatierno commented 2 years ago

@mateuszrzeszutek actually just adding the dependency to the vertx-opentelemetry library seems to work with a kind of auto-discovery. Yesterday, I didn't notice that that dependency was already there but with test scope so not available at runtime. Now it works out of box. Does it sound right to you? Or we should set the system property above anyway?

alesj commented 2 years ago

Or we should set the system property above anyway?

Looking at the code, you only need to setup sys prop if you have multiple providers on the classpath -- and you need to pick an explicit one. We don't have that (for now), so I wouldn't bother -- it will show if we need it, as it will fail at startup. :-)

mateuszrzeszutek commented 2 years ago

Looking at the code, you only need to setup sys prop if you have multiple providers on the classpath -- and you need to pick an explicit one.

Yeah, that's correct, I forgot about that.

Does it sound right to you? Or we should set the system property above anyway?

That look alright, you don't have to set the property after all.

Glad that it's working for you!