quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.83k stars 2.7k forks source link

Asynchronous CDI events do not propagate the opentelemetry context correctly #44585

Closed Malandril closed 2 hours ago

Malandril commented 2 days ago

Describe the bug

When using asynchronous CDI events, the opentelemtry contextis not correctly propagated to the async observer. If the request triggering the event finishes before the observer, the observer loses its trace and span.

Expected behavior

The @ObservesAsync method should have the same opentelemetry trace as the one who fired the event until the end.

Actual behavior

The @ObservesAsync method can "lose" the opentelemetry context.

How to Reproduce?

Reproducer here: https://github.com/Malandril/quark-otel-cdi

The reproducer demonstrates the loss of the traceId in the logs after the Thread.sleep(1000) . Image

Output of uname -a or ver

No response

Output of java -version

21

Quarkus version or git rev

3.16.3

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

quarkus-bot[bot] commented 2 days ago

/cc @Ladicek (arc), @brunobat (opentelemetry), @manovotn (arc), @mkouba (arc), @radcortez (opentelemetry)

mkouba commented 2 days ago

I cannot say what the correct behavior is but ArC is definitely not instrumented with OpenTelemetry tracing; see also the list of instrumented extensions: https://quarkus.io/guides/opentelemetry-tracing#quarkus-extensions-using-opentelemetry. So this is not a bug but more like a feature request.

Ladicek commented 2 days ago

Well, as the image of the log shows, the trace context is propagated correctly. It's just that it is ended before the asynchronous observer completes, which should be expected -- it's the end of the HTTP request that ends the trace, and there's no coordination facility for asynchronous observers (to be able to wait until all of them have finished). This is a classic context propagation problem -- when you fire and forget an action to which the context is propagated, how does that affect the ending of the context?

manovotn commented 1 day ago

Well, as the image of the log shows, the trace context is propagated correctly. It's just that it is ended before the asynchronous observer completes, which should be expected -- it's the end of the HTTP request that ends the trace, and there's no coordination facility for asynchronous observers (to be able to wait until all of them have finished). This is a classic context propagation problem -- when you fire and forget an action to which the context is propagated, how does that affect the ending of the context?

Ladislav is right on the money.

If you need the context state preserved for the entire duration of observer method execution, then using @Observes instead of @ObservesAsync should do the trick.

brunobat commented 1 day ago

We should not wait for independent async tasks that are spawn from the HTTP request. Not sure what work is being done, but if the task is independent, they must start their own trace and have it linked to the parent HTTP span, not a parent/child relationship. If the task is not independent and is needed for the request, @manovotn suggestion can do the trick. In any case, I suggest closing this issue.

mkouba commented 1 day ago

Well, as the image of the log shows, the trace context is propagated correctly.

No, it is not propagated by Quarkus. In the reproducer, the io.opentelemetry.context.Context is part of the event payload. So it's more like a "manual propagation". But I agree that it's expected that it could be ended before the CDI async observer completes.

BTW I'm not quite sure it is safe to use the io.opentelemetry.context.Context on a different thread.

Ladicek commented 1 day ago

No, it is not propagated by Quarkus. In the reproducer, the io.opentelemetry.context.Context is part of the event payload.

Ah, OK, I didn't really look into the code :-) And looking at this issue, that seems correct.

Also, agree with Bruno that parent/child relationship is most likely incorrect in this case.

BTW I'm not quite sure it is safe to use the io.opentelemetry.context.Context on a different thread.

It should be, based on the javadoc:

A context propagation mechanism which can carry scoped-values across API boundaries and between threads. ... Context objects are immutable ...

Malandril commented 1 day ago

Ok thanks for the replies, i think it should be documented then, that the @ObservesAsync does not correctly create Otel context as we encountered this in production a bit too late, as the simple cases where the async treatments where short we had the correct context.

mkouba commented 1 day ago

Ok thanks for the replies, i think it should be documented then, that the @ObservesAsync does not correctly create Otel context as we encountered this in production a bit too late, as the simple cases where the async treatments where short we had the correct context.

Well, as I already mentioned it's documented in the OTel docs, more precisely it's not listed in the supported extensions/functionalities. And that's IMO the right place 🤷 .

brunobat commented 1 day ago

Also, part of the openTelemetry specification, async tasks out of the main trace must specify links: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md#links-between-spans

The new Trace and the link to the parent span need to be specified manually in the user code.

geoand commented 2 hours ago

Closing this as per the excellent replies already provided