Client libraries: Manual/programmatic and auto instrumentation duplication problem

lmolkova commented 3 years ago

What are you trying to achieve?

I'm working on Azure SDK instrumentation and we have a problem with double instrumentation on the HTTP layer. Here's the context:

we want to support users who auto-instrument (with bytecode agent, monkey-patching, diagnostic sources) and those who don't - it's not in our control.
we instrument public API calls like 'upload blob' (so users know what happens on the public surface of the library and can map spans to their code). Public API calls translate into a number of REST calls (retries, auth, complex operations) and underlying HTTP calls give users insights into what happens in the SDK too.
we want to instrument HTTP layer so we can
- add extra properties that are specific to Azure and provide a better experience to users (e.g. users need them to get official support). We occasionally want to strip some sensitive parameters. All those are per-http request and have no place on the parent logical span.
- support legacy context propagation protocols with Azure services
- we can't rely on the auto-instrumentation (even if it's on) because we can't augment the span that will be created with auto-instrumentation on a lower API level with all the details that are important.

So users who don't use auto-instrumentation have spans representing HTTP calls from our client libraries, users who use auto-instrumentation, would get two spans for the same HTTP call.

Generalizing this problem beyond Azure SDKs:

TL;DR: - Client libraries cannot trace common protocol-level calls if those could also be traced by auto-instrumentation. - If they do (when auto-instrumentation is on), duplicate spans will be created and context propagation will be broken. - Libraries could make their programmatic tracing configurable, but then they cannot inject service-specific data into auto-instrumented RPC-call span

Potential solutions

Backoff if the context is injected. If the context is already injected on the request (HTTP/gRPC/anything else), there is no other good option than to back off. Options are:
- re-instrument, i.e. create a new span and inject header (replace or add another value)? Then the logic that injected the header (and created a previous span) will be broken. There is no way to suppress that span.
- re-instrument, but not inject header? Then this instrumentation layer is broken - there is no reason to export this span
- don't instrument: ok, someone above already instrumented this request and perhaps created a span, nothing else to do. It seems nothing is broken and we didn't even create a span

This approach also means that the user's manual instrumentation always wins, which seems like a good default to have in terms of supportability. This is really short-term mitigation for a subset of double-instrumentation problems.

A similar contract might exist for server-side auto-instrumentation: if there is a current span already - back-off, but I can imagine it can go sideways if requests start from a thread that has span (e.g. created in start-up) by mistake.

Long-term solution: some level of coordination between layers of instrumentation. Some discussion on it #1747.
- Down-to-earth approach we discussed a while back in the OpenCensus community was Terminal Context (perhaps terminology is outdated, happy to update if there is an interest). This only works for client-side instrumentation.
- anything else?

Existing discussions

https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/903
1747
530
1653

Suppressing instrumentation implementations in OTel

Please note that context suppression is not really possible in client libraries as it requires dependency on instrumentation packages (to export suppress key). So this is not a viable workaround.

Happy to hear suggestions and drive it.

jkwatson commented 3 years ago

re-instrument, i.e. create a new span and inject header (replace or add another value)? Then the logic that injected the header (and created a previous span) will be broken. There is no way to suppress that span.

Why does this break the other span? I would assume that the newly created span will be the child of the one that first tried the injection, and so replacing the outgoing headers seems safe to me. Am I missing something?

lmolkova commented 3 years ago

Good point. It will be child only if the previous span was current, but there is no requirement to make it current and there is no reliable mechanism for context propagation that would work everywhere.

Also, it does not seem beneficial: it seems that if something has added headers, re-doing it is likely redundant.

lmolkova commented 3 years ago

Another issue is increased costs, users pay for telemetry either to vendors or through operational costs. If we agree this case is legit, we're talking about a significant increase in cost.

jkwatson commented 3 years ago

FYI, we (Splunk) have customers who are demanding the multiple layers like this all be represented, so there probably needs to be configurability around how this is handled.

lmolkova commented 3 years ago

@jkwatson Do multiple layers inject headers? If not - there could be any number of layers, just the one that injects headers wins. Also, behavior is currently language-specific (e.g. .NET instrumentation backs-off, Java agent replaces header, not sure about others), so does it work now?

jkwatson commented 3 years ago

@jkwatson Do multiple layers inject headers? If not - there could be any number of layers, just the one that injects headers wins. Also, behavior is currently language-specific (e.g. .NET instrumentation backs-off, Java agent replaces header, not sure about others), so does it work now?

There's less of an issue right now with double-injecting headers (the "Setter" spec does imply that you overwrite the headers, so that's really not an issue, I don't think). The issue is really that the javaagent has a policy of enforcing only a single "CLIENT" span, which means that multiple layers are being suppressed by the javaagent instrumentation currently. However, there is now an open discussion to deal with this and optionally allow nested CLIENT spans to proceed: https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/3318

It would be good to have this be a part of the spec, though, and not just something pushed because one vendor has customers that want it. :)

kenfinnigan commented 3 years ago

I came across this very issue today in working on tracing in Quarkus.

I had the Vert.x Tracing SPI creating a HTTP CLIENT span after a JAX-RS ClientRequestFilter had already created an HTTP CLIENT span. It meant I had one CLIENT span with a child of the SERVER span it called, and another CLIENT span completely separate.

I ended up checking for traceparent in the outgoing request headers inside the Vert.x Tracing SPI and didn't create a new span if it was set.

Would be good to have a defined approach to how to know a duplicate span would be created if a piece of code created a new span.

lmolkova commented 3 years ago

More on this in #1811: backing off is not a great strategy because of potential reuse of http request instances https://github.com/open-telemetry/opentelemetry-specification/issues/1811#issuecomment-878692618.

Additional context. It not very common to reuse HTTP requests between retries, but e.g. golang http client allows that and there are likey other clients/cases where it happens.

[just as a tangent: I was surprised to find recently in https://github.com/[open-telemetry/opentelemetry-java-instrumentation/pull/2727](https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/2727) that reusing HTTP requests is supported by most of the 20+ Java http client libraries that we instrument]

I agree that we need a way to identify if we are in a nested CLIENT span, for a couple of reasons:

to avoid overwriting traceparent from the outer CLIENT span

to conditionally suppress nested CLIENT spans (based on some kind of user verbosity preference)

It sounds like there are at least a couple of options to accomplish this:

use the presence of traceparent to know that we are in a nested CLIENT span (this requires clearing traceparent at the end of each request if http request reuse is supported by the instrumented library)

use some kind of marker in the Context to know that we are in a nested CLIENT span (this requires setting the marker in the context and propagating it to any downstream libraries that may also be instrumented)

Maybe we can focus initially on

Getting agreement that we really do need a way to identify if we are in a nested CLIENT span

Listing options to accomplish that

lmolkova commented 3 years ago

We need simple way to prevent multiple levels of HTTP instrumentation (duplicates from client libraries/users AND auto-instrumentation). Initial discussion on Java instrumentation SIG 2021/7/15 (UTC) https://docs.google.com/document/d/1WK9h4p55p8ZjPkxO75-ApI9m0wfea6ENZmMoFRvXSCw/edit

open-telemetry / opentelemetry-specification