w3c / trace-context

Trace Context
https://w3c.github.io/trace-context/
Other
468 stars 75 forks source link

Revisit header name -- Server-Timing vs. traceresponse #556

Open jpkrohling opened 8 months ago

jpkrohling commented 8 months ago

Related to #69, I would like to reopen the discussion around the header name for the client propagation of server tracing information.

The current state of the art among practitioners is to use the Server-Timing header, which is part of a safe-list of browsers today. A new header, such as traceresponse, would require a lot of effort to get included in those lists and take a long time before this is ubiquitous among client devices.

The linked issue was closed stating that it was decided against using Server-Timing, but without giving a reason for that. As I mentioned on that issue, by looking at the minutes, I could guess that the reason is related to this comment:

Yoav: This should be opt in, with the bare minimum number of resources that you need.

If that's the concern, isn't the server side already opting in by adding the response metric to this header? Like:

Server-Timing: traceresponse;desc=00-{trace-id}-{child-id}-01

If there's no other reason, I would like to propose a change to the current draft, so that the traceresponse isn't a header, but a metric of the Server-Timing header. This way, we can co-exist with other competing standards and offer a lower-friction migration path to people using Server-Timing today.

References:

basti1302 commented 8 months ago

Related: https://github.com/open-telemetry/opentelemetry-specification/issues/3811

dyladan commented 7 months ago

@jpkrohling thanks for bringing this back up. I've reached out to a few people who were involved in the initial discussion to see if they can provide more context as to why server-timing was rejected before, since nobody currently involved in this group was around at that time. We also discussed this in the meeting last week. Here is a quick summary of what we discussed:

Regarding the last 2 items, we intend to discuss these with the server-timing specification editors to see if they are actually a problem or not.

dyladan commented 7 months ago

Met with Yoav from the web performance group today about this. Notes from the meeting:

kalyanaj commented 7 months ago

Thanks @dyladan for the above notes and summary. As discussed in the W3C DT working group meeting, can we discuss the evaluation criteria (ideally, we should rank them in terms of the most important to least important) and then score these options for those criteria? Please let me know if you prefer a different approach.

Here's my initial attempt at the list of ranked criteria & how these two options (Update: adding a third option for discussion) meet those criteria. Please feel free to edit the below contents directly so that we can collaboratively close on this list:

Are we missing any other major criteria for decision making? Should we add the ones about the proxies/loadbalancers handling?

Here's an attempt at scoring these two options for the above criteria.

Criteria Traceresponse header Using server-timing header USE BOTH!: Traceresponse header for the most part + use server-timing only for initial page load by browsers
Must be standards based Yes (there's a path) Yes (there's a path) Yes (there's a path)
Trace context propagation from callees to callers Yes Yes Yes
Supported by browsers No (complex to gain adoption) Yes Yes
Supported for server-server Yes Yes Yes
Must be extensible in the future Yes Yes Yes
Semantically clean Yes No (arguable) Yes
Reasonably simple to implement Yes Yes TBD

Thoughts? Please feel free to edit directly the list & table above.

kalyanaj commented 7 months ago

Also, I am looking to understand better:

Is it the file load scenario: where a call is made to download a file and a trace id is returned as part of the response and the browser needs to continue that trace id for the remaining work?

What are the other interesting use cases? Looking to learn more to improve my understanding of the browser side DT / traceresponse use cases.

yurishkuro commented 7 months ago

I have a somewhat made-up scenario (from a demo app) - the UI reads the traceID from the response header and uses it to display a link to the trace for the previous action. https://github.com/jaegertracing/jaeger/blob/e08f576fd64a992ef0396112bc8401472cc9dd92/examples/hotrod/services/frontend/web_assets/index.html#L109

kalyanaj commented 7 months ago

Added a third option to the above table (keep Traceresponse header but use server-timing header only when returning to browsers) for discussion.

This is based on the assumptions that:

If the above assumptions are true (I could be wrong here - not a browser expert) & if there's a way to disambiguate initial page load, then this option maybe worth discussing. Including this option to avoid narrow framing and to widen our options for discussion.

jpkrohling commented 7 months ago

@kalyanaj , about the use-cases where browser support is needed, I believe that @cedricziel can elaborate on that, but here's some more information and context: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1891737668

It's basically the same case that @yurishkuro mentioned before, the only change being that browser-based telemetry tools (like Grafana Faro) can use this header to create span links between frontend and backend traces.

basti1302 commented 7 months ago

Also, I am looking to understand better: the use cases where browser support is needed for traceresponse header, and...

Another use case is a customer support scenario. When a page load fails, having the trace context of that failed page load available in the browser enables showing the trace context on the error page or in automatic ticket creation. Customer support folks can then use the trace id to check the observability tooling to get more information about the failure.

But I believe being able to link the initial page load to the server side trace in the data the client side instrumentation sends to the observability backend is the most relevant use case.

That plus what Ben says here: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1898601419 -- even for requests other than the initial page load (XHR/fetch), using a custom header creates same-origin policy issues. (Yoav pointed out that cross-origin might still be an issue in Safari with Server-Timing, but in general the situation with respect to cross-origin is already a lot better with Server-Timing compared to custom headers.)