open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.71k stars 887 forks source link

Discuss how to handle multiple agents trying to modify the tracecontext headers #1820

Closed kamtschatka closed 3 months ago

kamtschatka commented 3 years ago

What are you trying to achieve?

Propagation of TraceContext Headers should be working properly in case 2 wrappers are available on a webpage.

Additional context.

I have posted this already in the Trace Context issue tracker (https://github.com/w3c/trace-context/issues/451) and the OTEL JS issue tracker (https://github.com/open-telemetry/opentelemetry-js/issues/2295) but I was sent here, because apparently this is a spec issue.

Here is the description once again:

2 agents are on the page(OT and NewRelic) OT calculates the trace-parent header and a CRC value NewRelic modifies the trace-parent header, causing the CRC check to fail The problem here seems to be NewRelic, but if you see it more general, the same thing can happen to OT, because OT JS is also not considering any already existing trace-parent headers and just sets a new one, leading to the same outcome. With the current state of any implementation on the markt, I am guessing this is the same.

As we are also concerned from our side about this scenario, I would like to trigger a discussion here, because every Vendor would need to implement a solution for this to make it truly interoperable. My suggestion would be that the trace-parent header should be reused in case an XHR/fetch wrapper encounters this header instead of overriding it. The vendors themselves are then responsible to implement their own way of dealing with these cases. My idea would be to just store the trace-parent header in your internal datastructure and use it as an additional information to find the server side path for it.

I have now focused on the trace-parent header as this is not designed to have multiple values, but the tracestate header is also affected to some extent.

I'd be interested in your thoughts and ideas on how to solve this problem.

Oberon00 commented 3 years ago

I have now focused on the trace-parent header as this is not designed to have multiple values, but the tracestate header is also affected to some extent.

The tracestate header is designed to be unambiguously matched to a traceparent header, so I would say it is affected just as much (for example, it might not be possible to interpret some tracestate entry without knowing the trace ID from the traceparent).

Oberon00 commented 3 years ago

The solution used previously was that each vendor uses their own header name. The problem of overriding each other came to be exactly because we standardized on a single header name. I think one solution may be to have a unique header prefix per "participant"/"tenant". E.g. you could set OTEL_CONTEXT_NAMESPACE=abc123 and it would then use traceparent-abc123 and tracestate-abc123 as header name.

I think this solution is better than trying to munch together multiple traces into a single traceparent header (for example Vendor 1 might already have a different trace ID from some incoming legacy tracing header, while OTel, having not interpreted that header, has chosen a different one; now it's impossible that they both use the same traceparent header).

kamtschatka commented 3 years ago

I mean sure that would be an option, but then the question is what is the point of having a standard if nobody is using the standard? Or if the standard is only followed until a problem is discovered. A big advantage in the future (hopefully) is, that more and more vendors will allow those trace context related headers through firewalls, reducing the very common problem of having headers removed, which will result in broken traces.

According to the spec, the headers MUST be passed on and they MAY be modified (https://www.w3.org/TR/trace-context/#mutating-the-traceparent-field) Right now what happens is the "restart trace" action, but if I understand it correctly, this is reserved for front gates of secure networks, so that doesn't really apply to OTEL.

dyladan commented 3 years ago

I agree with @kamtschatka whatever we decide has to be standard compliant. I think the only way to solve this is to have some sort of global vendor API so that agents can detect and communicate with each other. This would likely involve collaboration and code changes in otel and in the vendor agents.

Oberon00 commented 3 years ago

In the end, if you can avoid it, you should not have multiple agents trying to do the same thing in the same process (same webpage).

kamtschatka commented 3 years ago

@dyladan How did you envision that global API? Would the API just make the traceparent contexts available while inside of a wrapped XHR function? I am concerned that this would make us dependent on finding an agreement on yet another API, when the XHR object already offers the setRequestHeader function where someone can set those headers for a specific XHR.

@Oberon00 That would be the ideal scenario, but this is not going to happen ;-) We regularly see customers who have more than 1 tracing vendor on their page. It also goes against the whole idea of the TraceContext standard that is designed to make multiple vendors work together.

Oberon00 commented 3 years ago

It also goes against the whole idea of the TraceContext standard that is designed to make multiple vendors work together.

My impression is rather that tracecontext makes multiple tracing vendors working together harder, but it makes the job easier for cloud, proxy, network software vendors who have a standardized header name they can propagate from incoming to outgoing requests.

kamtschatka commented 7 months ago

Implementing the W3C trace context spec is not hard if there is only 1 vendor on the page. Setting the headers on XHR and fetch requests can be done easily.

It gets complicated when multiple vendors are on the page, wrapping the same XHR/fetch functions:

In theory it is possible to change the order of the scripts on the page to favor Vendor1 over Vendor2, but that just moves the problem to another vendor. It would in theory also be possible to not support this scenario, but from experience, it is very likely that 2 Vendors are used to monitor a web application and customers are not willing to remove one or the other.

So ideally there is a way to make all Vendors work together properly, to ensure proper correlation between client side and server side.

There are 2 options to make that happen:

Updating the internal state with a new traceid/spanid might not be possible or not desirable though, as it might have been propagated/referenced by a span already, which would break the references.

XMLHttpRequest

For monitoring the XHR object, we have so far seen 3 approaches:

One of the big limitations with the XMLHttpRequest objects is that it only has a setRequestHeader function, but no way to get the headers out again. The Vendor closest to the original XHR object has ultimately full control over which headers get set(or if there is no special implementation, setRequestHeader will be called multiple times with the same header name, leading to concatenation), the other Vendors have to blindly trust that the header gets actually set. This is not very reliable though (and more and more Vendors might be tempted to prevent setting the traceparent header twice to prevent completely broken correlation)

To solve the issue we came up with a few options:

Fetch

For fetch calls it is a lot easier, because headers are passed to the fetch function in an object.

This leaves the following options:

Potential issues:

This is in no way a complete list of issues/advantages/disadvantages, but should be a good basis for further discussions between Vendors, to find a solution where everyone can benefit in multi-vendor scenarios, instead of starting a fight over injection order and incompatibilities.

Flarna commented 7 months ago

I fear that agreeing on traceId/spanId across different vendors/tools wont work in general. One tool might group two requests on one trace wheres the other prefers separate traces. As a result traceId must be the same for the one but different for the other.

What would be needed is a multi tenant propagation format but that's not the case for W3C traceparent. And the initial post indicates that w3c doesn't seem to adapt this as they see this as a problem to solve of the users.

W3C tracestate on the other hand is multi tenant capable.

But also for tracestate the current implementation of OTel propagators is problematic because they just write the value from OTel point of view. If another tool has written tracestate already it's overwritten. The propagators inject API doesn't provide read access to the carrier therefore Propagators can't merge even if they would like to do.

austinlparker commented 3 months ago

After discussing this issue, we will be closing it as out of scope. Fundamentally, our preference would be for vendors to implement options that allow them to 'play nice' with OpenTelemetry.

kamtschatka commented 3 months ago

Wouldn't everybody like that? Shouldn't there at least be some guidance on how that could look like?

Flarna commented 3 months ago

@austinlparker Could you please provide some more details from that discussion? I find it a bit unfriendly to expect everyone to act as a friendly neighbor without OTel even trying to do the same.

austinlparker commented 3 months ago

I can elaborate a bit more on our rationale -

It would be a pretty big lift to try and define spec-level affordances for things like this given the differences from language to language in how instrumentation hooks function.

If there's sufficient demand from users then we can of course re-evaluate this issue, but for now, we believe it's out of scope. If you do believe it's germane, I would suggest creating a new issue and bringing it to the specification SIG to present a proposal.

trask commented 3 months ago

Here's the OpenTelemetry Java agent's position on Running agent alongside other java agents - DataDog, AppDynamics, NewRelic...

this is not something we plan to officially test/support, but we will accept PRs to make it run better alongside other agents.

(https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/1534#issuecomment-812939452)

kamtschatka commented 3 months ago

Truly a shame, as this is a missed opportunity to bring different people from different vendors together to work on a solution together... I am pretty sure vendors will not make sure they work with every other vendor our there and since we don't agree on a way to solve this for everyone, everyone will have their own solution that is easy to implement and sure to work for them. Our only choice will be to block all the traceparent headers someone down the line tries to set, to avoid broken contexts for us (if our script is the first one on the page) and we'll defer blame to anybody up the chain for breaking the traceparent header and setting an additional value, as there is no other way now.

trask commented 3 months ago

@kamtschatka this issue and discussions are very JavaScript specific, so it's hard for me to understand what specifically you want to change in the specification itself.

maybe it would help to work in the JavaScript SIG to explore solutions and come back to the specification once there's a better understanding of what specification changes are needed?