Open Orbifoldt opened 1 year ago
/cc @Sgitario (rest-client), @brunobat (opentelemetry), @cescoffier (rest-client), @geoand (rest-client), @radcortez (opentelemetry)
cc @jamezp
This instrumentation is quite old. There might be something going on in the propagation of REST_CLIENT_OTEL_SPAN_CLIENT_CONTEXT
It definitely looks like something is likely not being propagated. I'm not too sure how we'd figure that out without a reproducer. Or I should say maybe I'm not sure how I would :)
Thanks for your replies @brunobat and @jamezp . I'll try working on some minimal reproducer.
In the mean time, could you maybe point me to where the REST_CLIENT_OTEL_SPAN_CLIENT_CONTEXT
should be initialized? Then I can attach a debugger and try to find where it goes wrong
@Orbifoldt Awesome, thanks!
It looks like in the ClientRequestFilter
portion https://github.com/quarkusio/quarkus/blob/8f0e94e0238ba4ee75d2cbaada1c72be01edac43/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/tracing/intrumentation/restclient/OpenTelemetryClientFilter.java#L99.
The spanContext is not properly initialized, it's already null
when its added as a property to the request. This seems to originate from persisting the span into the OpenTelemetry Context
.
In debugging I didn't get very far in finding the root cause of this. But maybe this means something to someone, so this is what I found:
spanContext
is being created here: https://github.com/quarkusio/quarkus/blob/8f0e94e0238ba4ee75d2cbaada1c72be01edac43/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/tracing/intrumentation/restclient/OpenTelemetryClientFilter.java#L91Instrumenter
s start
method, which simply calls the doStart
method: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/dc5d76af68299c1cf517360f47125bce4ca6467d/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/Instrumenter.java#L162 doStart
a new Span
is created correctly, so this works fine. spanSuppressor
is called to store this span into the context https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/dc5d76af68299c1cf517360f47125bce4ca6467d/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/Instrumenter.java#L208DelegateBySpanKind
span suppressor's storeInContext
method: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/dc5d76af68299c1cf517360f47125bce4ca6467d/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/SpanSuppressors.java#L73spanKeys
contains a single SpanKey
namely "opentelemetry-traces-span-key-http-client"
. Then this storeInContext
method of the SpanKey
is called: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/dc5d76af68299c1cf517360f47125bce4ca6467d/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/SpanKey.java#L80null
Context. So this is where it goes wrong. However, here the trail goes cold :( It invokes some shaded OpenTelemetry package, and I am unable to see from my remote debugger why null is returned@jamezp I've set up a reproducer here: https://github.com/Orbifoldt/quarkus-otel-azure-npe-reproduction Seems like you don't even need a valid Applicaiton Insights resource to have this error happen
we have the same issue with Azure Fuctions + JAXRS client + OpenTelemetry. Any idea if this bug will be fixed in near future?
It's scheduled for next quarter but is open for grabs.
@brunobat the issue is till open. The last quarter of 2023 is over. Is there an updated timeline?
Sorry, right now there is no estimate. Contributions are welcome!
- Here,
spanKeys
contains a singleSpanKey
namely"opentelemetry-traces-span-key-http-client"
. Then thisstoreInContext
method of theSpanKey
is called: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/dc5d76af68299c1cf517360f47125bce4ca6467d/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/SpanKey.java#L80
I'm encountering this same issue but with Ratpack. I've traced this down to something in the SpanKey
AOP code, but since you can't debug AOP injections, I'm stuck there. I can see my debugger getting into this method, but I get to at least the Bridging.toAgentOrNull, but can't definitely see what it exits from there.
Some additional info for me - I'm using Ratpack 1.9 and I have both the javaagent and the instrumentation library added - it seems this is needed for Ratpack, because you have to manually wire the HttpClient (cannot find any documentation specifying not to do this or even what the appropriate configuration is for this library).
If I remove the javaagent, then this error doesn't occur. If I remove the library, the error doesn't occur, but that's because the HttpClient instrumentation is not being done at all.
Describe the bug
The
OpenTelemetryClientFilter
is causing aNullPointerException
. In particular, thespanContext
as retrieved from the client request context is null: https://github.com/quarkusio/quarkus/blob/8f0e94e0238ba4ee75d2cbaada1c72be01edac43/extensions/opentelemetry/runtime/src/main/java/io/quarkus/opentelemetry/runtime/tracing/intrumentation/restclient/OpenTelemetryClientFilter.java#L121This issue seems only to occur when using the Microprofile Rest Client (resteasy classic) in order to invoke some other service. The server itself does not have this issue when it is called, nor do the autoinstrumented sdks that we use.
I am using the Azure Application Insights Java Agent and the Quarkus OpenTelemetry extension. Project is written in Kotlin. I was using a customized propagator, but also with the default ones this error occurs. This issue started occurring after we upgraded from Quarkus 2.16 to 3.2.
Full stack trace:
Expected behavior
Rest client is instrumented and does not throw exceptions
Actual behavior
NullPointerException is thrown
How to Reproduce?
I haven't been able to reproduce this locally using the Azure app insights agent, there are some network/proxy settings that seem to prohibit me from getting it to work. Using the jaeger-all-in-one docker container (as described in the documentation) locally did not give above error.
Output of
uname -a
orver
Linux 5.4.0-1111-azure #117~18.04.1-Ubuntu SMP Wed Jun 21 15:44:28 UTC 2023 x86_64 Linux
Output of
java -version
openjdk version "17.0.8" 2023-07-18 LTS
GraalVM version (if different from Java)
n/a
Quarkus version or git rev
3.2.1.Final
Build tool (ie. output of
mvnw --version
orgradlew --version
)No response
Additional information
No response