quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.44k stars 2.58k forks source link

OpenTelemetry Scope not closed correctly #25102

Open fabrii opened 2 years ago

fabrii commented 2 years ago

Describe the bug

This issue originated on the following discussion: https://github.com/quarkusio/quarkus/discussions/23882 I am opening it to keep track here.

When using Baggage propagation, the Scope is not closed correctly.

Baggage.current()
            .toBuilder()
            .put("system", "systemX")
            .build()
            .makeCurrent();

In the log, we can see the following warning:

WARN [io.qua.ope.run.QuarkusContextStorage] (vert.x-eventloop-thread-7) Context in storage not the expected context, Scope.close was not called correctly.

Expected behavior

Not showing the warning. If it is harmless, as a workaround I think we can just ignore the log using quarkus.log.category.

Actual behavior

No response

How to Reproduce?

Reproducer: https://github.com/fabrii/quarkus-playground/tree/opentelemetry

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

2.7.1

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

quarkus-bot[bot] commented 2 years ago

/cc @radcortez

geoand commented 1 year ago

Is this still an issue or has it been addressed by the various fixes that have occurred over time?

mihaipoenaru commented 1 year ago

@geoand Google brought me to this issue because I got the same warning as above. I'm not sure if it's relevant or not.

I made a small reproducer to force the Quarkus instance to use one thread, and then I triggered a context switch for that thread to see if the traceID propagates correctly:

@Path("/hello")
class ExampleResource {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    suspend fun hello(): String {
        println("${LocalTime.now()} ${Thread.currentThread().name}: ${Span.current().spanContext.traceId}")
        delay(3000)
        println("${LocalTime.now()} ${Thread.currentThread().name}: ${Span.current().spanContext.traceId}")
        return "hello"
    }
}

This produces the following, after I called it twice in quick succession:

17:41:21.171259400 vert.x-eventloop-thread-0: 77f541d67a9a3bda6e1b96ec97ff805b
17:41:21.698196200 vert.x-eventloop-thread-0: 330def074a9c74332334f91283c94f1a
17:41:24.184476200 vert.x-eventloop-thread-0: 77f541d67a9a3bda6e1b96ec97ff805b
2022-10-27 17:41:24,234 WARN  [io.qua.ope.run.QuarkusContextStorage] (vert.x-eventloop-thread-0) Context in storage not the expected context, Scope.close was not called correctly
17:41:24.715261 vert.x-eventloop-thread-0: 330def074a9c74332334f91283c94f1a

Fortunately, the context trace id is switched appropriately. I had some issues on my project because the old, "soon-to-be-deprecated" OpenTracing failed to do the context switch and would always keep the latest traceId.

I'm not sure if this helps with the issue at hand.

brunobat commented 1 year ago

This last issue is related with the OTel context storage and restore. This is not specific to Baggage. As mentioned here: https://github.com/quarkusio/quarkus/discussions/23882#discussioncomment-2593231 It should be harmless.

geoand commented 1 year ago

Is there anything we need to do for this one or can it be closed?

famod commented 11 months ago

So I'm seeing this during a ui regression test run with Quarkus 3.3.2., even after I managed to get rid of the OTLP export timeout issues (see #35686). I don't see any other exceptions or similar. Any notes on debugging this?

brunobat commented 10 months ago

What were you expecting to see, @famod, this message? -> WARN [io.qua.ope.run.QuarkusContextStorage] (vert.x-eventloop-thread-7) Context in storage not the expected context, Scope.close was not called correctly.

famod commented 10 months ago

@brunobat I am seeing this warning multiple times but I don't see any hints of exceptions. I'm referring to what @radcortez said in the discussion:

The OpenTelemetry instrumentation is provided by filters, but certain exceptions that break the execution flow may happen outside filters, which may cause the OpenTelemetry Scope not to be properly closed at the right time (hence the warning).

So I basically want to get rid of those warnings.

brunobat commented 10 months ago

We should create a set of tests that replicate the scope closing issue and go from there... There also some other work needed, like propagate scope along with the span and make sure it is closed with the span.

famod commented 10 months ago

In the meantime, would it make sense to add more info to that warning? Something to correlate which context wasn't closed correctly?

brunobat commented 10 months ago

yes, @famod. Will do

brunobat commented 10 months ago

Here you have it @famod ! https://github.com/quarkusio/quarkus/pull/35886

famod commented 10 months ago

Thanks @brunobat!

I started looking at those now extended warning messages and so far, all logged "previous" traceids point to perfectly valid, non-erroneous REST calls. So for now I don't see what the problem is.

brunobat commented 10 months ago

Do you have an idea of the percentage of spans that get the warning? Can be trace back to a particular code flow or are the random?

kdubb commented 8 months ago

@brunobat I see these randomly in our tests using 3.5.3. I've just cleaned our build to filter all expected error and warning logs and these are left behind; unfortunately I can't say when they began.

They seem to be very random, with hundreds of @QuarkusTest classes and many thousands of tests... I don't even see them from every complete build.

Here's an example:

Context in storage not the expected context, Scope.close was not called correctly. Details: OTel context before: {spanId=8695a6782afb78ff, traceId=d22e83c1ed0f9f64abd2baf22d449508, sampled=true}. OTel context toAttach: {spanId=9e70ae76f4e97f48, traceId=d22e83c1ed0f9f64abd2baf22d449508, sampled=true, parentId=8695a6782afb78ff}
nicolasard commented 7 months ago

I have the same issue, found out that I have thise errors when the client abort the connection after sending the request.

computerlove commented 7 months ago

A reproducer I created for another issue reproduces this issue.

brunobat commented 7 months ago

Thanks @computerlove, Will take a look

brunobat commented 5 months ago

The fix for this issue will take while due to bandwidth. Will reduce the log from info to warn because it's causing issues in prod systems.

geoand commented 5 months ago

I did run into this while trying to reproduce #35686

brunobat commented 5 months ago

This happens when multiple spans are open and they don't close in the right order. This requires an overhaul of the way we handle span scopes and will be handled while we introduce the Observation API, @geoand

geoand commented 5 months ago

Understood, thanks.

AlanMasciangelo commented 1 month ago

Possibly another useful data point, I consistently see this after agroal does some work. Logs show jumping from executor-thread-n to agroal-n and then back and it happens immediately, I assume when the non-blocking query returns.