Open jpkrohling opened 10 months ago
@jpkrohling thanks for bringing this back up. I've reached out to a few people who were involved in the initial discussion to see if they can provide more context as to why server-timing was rejected before, since nobody currently involved in this group was around at that time. We also discussed this in the meeting last week. Here is a quick summary of what we discussed:
timing-allow-origin
Regarding the last 2 items, we intend to discuss these with the server-timing specification editors to see if they are actually a problem or not.
Met with Yoav from the web performance group today about this. Notes from the meeting:
Thanks @dyladan for the above notes and summary. As discussed in the W3C DT working group meeting, can we discuss the evaluation criteria (ideally, we should rank them in terms of the most important to least important) and then score these options for those criteria? Please let me know if you prefer a different approach.
Here's my initial attempt at the list of ranked criteria & how these two options (Update: adding a third option for discussion) meet those criteria. Please feel free to edit the below contents directly so that we can collaboratively close on this list:
Are we missing any other major criteria for decision making? Should we add the ones about the proxies/loadbalancers handling?
Here's an attempt at scoring these two options for the above criteria.
Criteria | Traceresponse header | Using server-timing header | USE BOTH!: Traceresponse header for the most part + use server-timing only for initial page load by browsers |
---|---|---|---|
Must be standards based | Yes (there's a path) | Yes (there's a path) | Yes (there's a path) |
Trace context propagation from callees to callers | Yes | Yes | Yes |
Supported by browsers | No (complex to gain adoption) | Yes | Yes |
Supported for server-server | Yes | Yes | Yes |
Must be extensible in the future | Yes | Yes | Yes |
Semantically clean | Yes | No (arguable) | Yes |
Reasonably simple to implement | Yes | Yes | TBD |
Thoughts? Please feel free to edit directly the list & table above.
Also, I am looking to understand better:
Is it the file load scenario: where a call is made to download a file and a trace id is returned as part of the response and the browser needs to continue that trace id for the remaining work?
What are the other interesting use cases? Looking to learn more to improve my understanding of the browser side DT / traceresponse use cases.
I have a somewhat made-up scenario (from a demo app) - the UI reads the traceID from the response header and uses it to display a link to the trace for the previous action. https://github.com/jaegertracing/jaeger/blob/e08f576fd64a992ef0396112bc8401472cc9dd92/examples/hotrod/services/frontend/web_assets/index.html#L109
Added a third option to the above table (keep Traceresponse header but use server-timing header only when returning to browsers) for discussion.
This is based on the assumptions that:
If the above assumptions are true (I could be wrong here - not a browser expert) & if there's a way to disambiguate initial page load, then this option maybe worth discussing. Including this option to avoid narrow framing and to widen our options for discussion.
@kalyanaj , about the use-cases where browser support is needed, I believe that @cedricziel can elaborate on that, but here's some more information and context: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1891737668
It's basically the same case that @yurishkuro mentioned before, the only change being that browser-based telemetry tools (like Grafana Faro) can use this header to create span links between frontend and backend traces.
Also, I am looking to understand better: the use cases where browser support is needed for traceresponse header, and...
Another use case is a customer support scenario. When a page load fails, having the trace context of that failed page load available in the browser enables showing the trace context on the error page or in automatic ticket creation. Customer support folks can then use the trace id to check the observability tooling to get more information about the failure.
But I believe being able to link the initial page load to the server side trace in the data the client side instrumentation sends to the observability backend is the most relevant use case.
That plus what Ben says here: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1898601419 -- even for requests other than the initial page load (XHR/fetch), using a custom header creates same-origin policy issues. (Yoav pointed out that cross-origin might still be an issue in Safari with Server-Timing, but in general the situation with respect to cross-origin is already a lot better with Server-Timing compared to custom headers.)
Are there any updates on this topic? I see that the Level 3 draft published to the website still uses traceresponse
. However, migration sounded pretty likely here: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1917893672
Related to #69, I would like to reopen the discussion around the header name for the client propagation of server tracing information.
The current state of the art among practitioners is to use the Server-Timing header, which is part of a safe-list of browsers today. A new header, such as
traceresponse
, would require a lot of effort to get included in those lists and take a long time before this is ubiquitous among client devices.The linked issue was closed stating that it was decided against using Server-Timing, but without giving a reason for that. As I mentioned on that issue, by looking at the minutes, I could guess that the reason is related to this comment:
If that's the concern, isn't the server side already opting in by adding the response metric to this header? Like:
If there's no other reason, I would like to propose a change to the current draft, so that the traceresponse isn't a header, but a metric of the Server-Timing header. This way, we can co-exist with other competing standards and offer a lower-friction migration path to people using Server-Timing today.
References: