quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.54k stars 2.61k forks source link

OpenTelemetry QuarkusContextStorage does not work correctly outside of vert-x duplicated context #36040

Open bcluap opened 11 months ago

bcluap commented 11 months ago

Describe the bug

In code running in Quarkus but inside of vert-x but without a duplicated context (e.g. one uses the vert-x web HTTP client), Opentelemetry context calls such as attach() and then current() do not work as expected in that the context passed in attach() is not that when calling current() immediately thereafter.

Expected behavior

Opentelemetry API's should work outside of a vert-x duplicated context and should fall back to the FALLBACK_CONTEXT_STORAGE

Actual behavior

This code in QuarkusContextStorage results in a call to attach() and then a call to current() both creating a new duplicated context. The state set in attach() is thus not available in the subsequent call to current(). See this:

public static io.vertx.core.Context getVertxContext() {
        io.vertx.core.Context context = Vertx.currentContext();
        if (context != null && VertxContext.isOnDuplicatedContext()) {
            return context;
        } else if (context != null) {
            io.vertx.core.Context dc = VertxContext.createNewDuplicatedContext(context);
            setContextSafe(dc, true);
            return dc;
        }
            return null;
        }
    }

The above code always returns a brand new duplicated context when running within vert-x but outside of a quarkus extension that has set up a duplicated context. This results in the QuarkusContextStorage being unusable as every call looses the prior calls context. I would say it would be better to do this:

public static io.vertx.core.Context getVertxContext() {
        io.vertx.core.Context context = Vertx.currentContext();
        if (context != null && VertxContext.isOnDuplicatedContext()) {
            return context;
        } else {
            return null;
        }
    }

That way the fallback context is used and propagation works as expected.

How to Reproduce?

Use a vert-x plugin directly along with calls to activate a child span and then get the current span

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

3.3.0

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

quarkus-bot[bot] commented 11 months ago

/cc @brunobat (opentelemetry), @radcortez (opentelemetry)

brunobat commented 11 months ago

The duplicated context was created for things like the OTel propagation. Reactive code cannot rely on ThreadLocal. It's not possible to seamlessly transition from imperative to reactive paradigms without it. Why do you need this feature?

bcluap commented 11 months ago

Hi Bruno,

Agreed it's not optimal to rely on threadlocal but the current logic results in no state being kept whatsoever unless there is a duplicated context already present. Maybe I'm going about this wrong in the first place. My challenge is that I'm using the vert-x mutiny HTTP client reactively and at the point where I get the result reactively I want to now add tags onto the OTEL trace. I have the span context and want to make it active and then know that log lines will have the MDC data and any calls to set tags and such will appear on the trace. The issue is that calls to make the span current (span.makeCurrent()) sets the span on a brand new duplicated context. Then calls that get the current OTEL span return a NOOP span as they again create a brand new duplicated context.

So I guess the issue is that the Quarkus OTEL tracing implementation does not work at all if one is using vert-x modules directly. Maybe there's a way to annotate or otherwise run ones own code in the same way as Quarkus does where the context is a duplicated context - is there a standard pattern for this?

geoand commented 11 months ago

So I guess the issue is that the Quarkus OTEL tracing implementation does not work at all if one is using vert-x modules directly.

Would you be able to provide a sample showing this?

bcluap commented 11 months ago

I'll try and create a reproducer this afternoon

bcluap commented 11 months ago
git clone https://github.com/bcluap/quarkus-examples.git
cd quarkus-examples/resteasy-reactive
mvn verify

The code starts a new clean platform thread to simulate a situation where one is outside of a vertx context. It then uses the vertx web client and OTEL. The vertx web client response processing has a vertx context but its not duplicated and in this situation the Quarkus OTEL tracing API's don't manage scope at all as every OTEL context operation is done in a new duplicated context. I then use a hack I came up with last night where I run the code in a duplicated context and it all works as expected. I don't think the hack should be necessary as the Quarkus OTEL context propagation should see that if I'm in a vertx context and its not duplicated then rather use the fallback OTEL context manager as at least it would keep context between calls on the same thread. Perhaps my use case could be avoided at some higher level but in our platform we use the vertx web client extensively (we built our own rest client on top of vertx web prior to resteasy reactive client being available). In scenario 3 the vertx web client is used inside a vertx duplicated context (as resteasy reactive thread context is used) and all works fine without the hack.

So all of this makes me think I maybe need to tell Quarkus to create/set up a duplicated context early on when using our own platform threads. I have even tried @ActivateRequestContext as I thought that this would maybe get quarkus to wrap the code with a duplicated context yet it seems like CDI context does not automatically do this - I dont know the whole relationship between CDI and Vertx contexts so maybe I'm barking up the wrong tree there.

brunobat commented 11 months ago

Will later. CC @cescoffier

geoand commented 11 months ago

@bcluap thanks for the analysis.

I have on question however:

You say that with:


public static io.vertx.core.Context getVertxContext() {
        io.vertx.core.Context context = Vertx.currentContext();
        if (context != null && VertxContext.isOnDuplicatedContext()) {
            return context;
        } else {
            return null;
        }
    }

in QuarkusContextStorage the context does get propagated - why is this?

bcluap commented 11 months ago

Because the "return null" means that this code ends up using the fallback context:

public Scope attach(Context toAttach) {
        io.vertx.core.Context vertxContext = getVertxContext();
        return vertxContext != null && isDuplicatedContext(vertxContext) ? attach(vertxContext, toAttach)
                : FALLBACK_CONTEXT_STORAGE.attach(toAttach);
    }
geoand commented 11 months ago

I definitely think it's a draft PR to at least see on your fork if CI passes

bcluap commented 11 months ago

The current implementation:

    public static io.vertx.core.Context getVertxContext() {
        io.vertx.core.Context context = Vertx.currentContext();
        if (context != null && VertxContext.isOnDuplicatedContext()) {
            return context;
        } else if (context != null) {
            io.vertx.core.Context dc = VertxContext.createNewDuplicatedContext(context);
            setContextSafe(dc, true);
            return dc;
        }
        return null;
    }

Results in triggering the else-if and returns a new duplicated context for the attach as well as any calls to current() as current does this:

 public Context current() {
        io.vertx.core.Context current = getVertxContext();
        if (current != null) {
            return current.getLocal(OTEL_CONTEXT);
        } else {
            return FALLBACK_CONTEXT_STORAGE.current();
        }
    }
bcluap commented 11 months ago

I definitely think it's a draft PR to at least see on your fork if CI passes

Lets see what @cescoffier says as maybe this could all be avoided by creating a duplicated context when in a request scope. My gut feel says that if I have asked Quarkus to put me in a request scope, then a duplicated context should be set up as part of that?

geoand commented 11 months ago

I have asked Quarkus to put me in a request scope, then a duplicated context should be set up as part of that?

Yeah, but you are asking for different request scope.

Essentially the problem boils down to the fact that Vertx clients are not "managed", whereas in the extension that use these under the hood, these kinds of details are taken care of already.

bcluap commented 11 months ago

Agreed, but maybe Quarkus needs an annotation similar to @ActivateRequestContext (maybe @QuarkusContext) that sets up a "managed" context in a standardised way much like the extensions do

geoand commented 11 months ago

Yeah, that's an option for sure. We'll discuss with @cescoffier

cescoffier commented 11 months ago

This is a very tricky situation. There is no good way to support it except by initially creating a duplicated context.

I'm a bit afraid about that fallback, as it is possible to write the data in the event loop thread local, which means it can be completely wrong and corrupted.

Typically, you cannot do this: https://github.com/bcluap/quarkus-examples/blob/master/resteasy-reactive/src/main/java/reactive/scenarios/TestRootResource.java#L53-L55

You need to be on a thread having a duplicated context, inheriting from the original one. Otherwise, there are no good way to propagate data safely and correctly.

geoand commented 11 months ago

@cescoffier I am not particularly interested in that case where an unmanaged thread is used, I am interested in https://github.com/bcluap/quarkus-examples/blob/master/resteasy-reactive/src/main/java/reactive/scenarios/TestRootResource.java#L64 however

cescoffier commented 11 months ago

@geoand You are in a duplicated context in this case. So, I'm surprised the HTTP client does not go back to the duplicated context automatically.

More especially, I would be interested to see on what context this is run: https://github.com/bcluap/quarkus-examples/blob/master/resteasy-reactive/src/main/java/reactive/scenarios/TestRootResource.java#L85

bcluap commented 11 months ago

This is a very tricky situation. There is no good way to support it except by initially creating a duplicated context.

I'm a bit afraid about that fallback, as it is possible to write the data in the event loop thread local, which means it can be completely wrong and corrupted.

Typically, you cannot do this: https://github.com/bcluap/quarkus-examples/blob/master/resteasy-reactive/src/main/java/reactive/scenarios/TestRootResource.java#L53-L55

You need to be on a thread having a duplicated context, inheriting from the original one. Otherwise, there are no good way to propagate data safely and correctly.

If I do need to create my own threads, is there a supported way of bootstrapping the thread so that it behaves nicely inside Quarkus? The example in the recreator is trivial and creating a new thread like that is pointless, but in the real world we have scheduled executors and things in the background that work perfectly in Quarkus with @ActivateRequestContext except for this one use case with OTEL.

cescoffier commented 11 months ago

The request context is also stored in the duplicated context (it follows the same logic as gRPC). The only difference is that you can activate it ... but it will NOT contain the request context data from the original context.

bcluap commented 11 months ago

@geoand You are in a duplicated context in this case. So, I'm surprised the HTTP client does not go back to the duplicated context automatically.

More especially, I would be interested to see on what context this is run: https://github.com/bcluap/quarkus-examples/blob/master/resteasy-reactive/src/main/java/reactive/scenarios/TestRootResource.java#L85

That runs on the correct context. The logging is a bit wrong as I added that case last and didn't update the logs

cescoffier commented 11 months ago

I may use context propagation to do that. But you are paying a cost.

To schedule your own thread, you must write an executor that captures and restores the duplicated context. Note that you will need to be sure that everything stored in the context is thread-safe.

bcluap commented 11 months ago

When we need to pass thread context (CDI and/or transaction) we use org.eclipse.microprofile.context.ThreadContext and this works perfectly. In fact let me add that into the example and see what it does

bcluap commented 11 months ago

I tried this which is how our real-world thread pools propagate context but still no duplicated context: https://github.com/bcluap/quarkus-examples/blob/cee50e1708014a3a3eefa5367c4c6eb0ab92d9df/resteasy-reactive/src/main/java/reactive/scenarios/TestRootResource.java#L47

cescoffier commented 11 months ago

yes, it's using context propagation. Most of the time we want to use the duplicated context, as it's faster/cheaper.

bcluap commented 11 months ago

Ok, so to summarise: When using vertx extensions in a reactive context but outside of a Quarkus-created thread, one must wrap the processing in something like this in order to use OTEL:

public Runnable ensureDuplicateContext(Runnable possiblyWrapped) {
        return () -> {
            if (VertxContext.isOnDuplicatedContext()) {
                possiblyWrapped.run();
            } else {
                Context con = VertxContext.getOrCreateDuplicatedContext();
                if (con == null) {
                    // No vertx context at all - nothing we can do to pass on context
                    possiblyWrapped.run();
                } else {
                    con.runOnContext((x) -> possiblyWrapped.run());
                }
            }
        };
    }
cescoffier commented 11 months ago

con.runOnContext((x) -> possiblyWrapped.run()); is only if run() is non blocking. For blocking use context.executeBlocking.

Another more hidden way is to use what I'm using for virtual threads (https://github.com/quarkusio/quarkus/blob/main/extensions/virtual-threads/runtime/src/main/java/io/quarkus/virtual/threads/ContextPreservingExecutorService.java)

bcluap commented 11 months ago

Hi @cescoffier, @geoand. So I've been looking at workarounds such as the code you shared but the issue here is that actually there is no VertX context at all when using a vertx extension from ones own thread and trying to create one from scratch seems non-trivial - e.g. knowing if the code is blocking or not (which one may not know at the infrastructure level). I'm now back on the train of thought that the OTEL context code has a bug. If I condense the current login into one function, it looks like this:

public Scope attach(Context toAttach) {
    io.vertx.core.Context context = Vertx.currentContext();
    if (context != null && VertxContext.isOnDuplicatedContext()) {
        return attach(context, toAttach);
    } else if (context != null) {
        io.vertx.core.Context dc = VertxContext.createNewDuplicatedContext(context);
        setContextSafe(dc, true);
        return attach(dc, toAttach);
    } else {
        return FALLBACK_CONTEXT_STORAGE.attach(toAttach);
    }
}

My thinking is that if there is a vertx context and its not duplicated, then creating a duplicated context knowing that it will never actually be the context for any subsequent calls (and is effectively being immediately thrown away) is a bug. My sense is that the logic really should be:

public Scope attach(Context toAttach) {
        io.vertx.core.Context context = Vertx.currentContext();
        if (context != null && VertxContext.isOnDuplicatedContext()) {
            return attach(context, toAttach);
        } else {
            return FALLBACK_CONTEXT_STORAGE.attach(toAttach);
        }
}

I understand that this means the context is attached to a thread local, but surely that is better than attaching it to a map on a pojo that has no references when this function ends?

cescoffier commented 11 months ago

No, that is not correct, it will end up putting a local thread in the event loop. So, if you are on the event loop you cannot use the fallback.

bcluap commented 11 months ago

Maybe I'm not understanding correctly. Yes it will end up adding a thread local on the event loop. Yes I agree that that means it wont propagate past that event being processed so is not reliable as a means of passing the context onto subsequent events that form part of that same "request". BUT - it would be better in that it at least keeps the OTEL context during that event as opposed to the current implementation that does not keep the context at all.

cescoffier commented 11 months ago

Adding a thread local to the event loop has terrible consequences. Concurrent but unrelated requests can share the same event loop. So, your thread local data is going to be corrupted.