Revisit header name -- Server-Timing vs. traceresponse

jpkrohling commented 10 months ago

Related to #69, I would like to reopen the discussion around the header name for the client propagation of server tracing information.

The current state of the art among practitioners is to use the Server-Timing header, which is part of a safe-list of browsers today. A new header, such as traceresponse, would require a lot of effort to get included in those lists and take a long time before this is ubiquitous among client devices.

The linked issue was closed stating that it was decided against using Server-Timing, but without giving a reason for that. As I mentioned on that issue, by looking at the minutes, I could guess that the reason is related to this comment:

Yoav: This should be opt in, with the bare minimum number of resources that you need.

If that's the concern, isn't the server side already opting in by adding the response metric to this header? Like:

Server-Timing: traceresponse;desc=00-{trace-id}-{child-id}-01

If there's no other reason, I would like to propose a change to the current draft, so that the traceresponse isn't a header, but a metric of the Server-Timing header. This way, we can co-exist with other competing standards and offer a lower-friction migration path to people using Server-Timing today.

References:

basti1302 commented 10 months ago

dyladan commented 9 months ago

@jpkrohling thanks for bringing this back up. I've reached out to a few people who were involved in the initial discussion to see if they can provide more context as to why server-timing was rejected before, since nobody currently involved in this group was around at that time. We also discussed this in the meeting last week. Here is a quick summary of what we discussed:

This is the solution already implemented by many modern APM vendors
Already implemented in many modern browsers https://caniuse.com/server-timing
- roughly 75% of users tracked by caniuse.com
- no iOS support
- insufficient safari desktop support (only available to network inspector, not JS API)
In 2018 when #69 was closed, support was in chrome only and was behind an experiment flag
server-timing is limited to same-origin policy unless specified otherwise using timing-allow-origin
- even if we define our own header, it is likely we would have similar restrictions in order to appease browser vendors
server-timing is restricted to secure contexts
The use seems to go against the intended use case for server-timing. trace ids are not a "timing" or a "metric"
server-timing is still a draft spec, and we are not sure if it is ok for us to build another spec on top of it while it is draft

Regarding the last 2 items, we intend to discuss these with the server-timing specification editors to see if they are actually a problem or not.

dyladan commented 9 months ago

Met with Yoav from the web performance group today about this. Notes from the meeting:

There has not been a strong demand from the community for the specification to become stable. If it is a concern, we can push forward on it. The person most responsible for driving the spec has moved on.
Sergey: stability helped with discussions with .NET for trace context
Bastian: if we want to become stable, we would want to rely on stable specifications (at least CR)
Yoav: likely there would not be strong objections
There has not been wide review for the spec yet
Sergey: would the web performance group have any specific objection to using server-timing for server use cases
Yoav: the header is optimized for timing metrics, but has been used for others things quite a bit. There is no specific objection to that. If there is a limitation, the API can be expanded.
Yoav: server to server seems perfectly fine and unrelated to the web API other than IANA registration
Sergey: Are there any well known non-timing use cases?
Yoav: not aware of that but it is possible
Dan: is there anything we need to be aware of for browser support? iOS and safari support is not yet at the point where it is useable
Yoav: webkit implementation is behind a flag since 2018. https://github.com/w3c/server-timing/issues/89 They are concerned about server-timing across origins. They wanted to block ST from cross-origin responses even when the server uses the timing-allow-origin header or CORS opt-in. A recent version of the spec allows this exemption. They also want to block resource timing across origins. They have said they would enable it for same-origin. Browser implementations are slightly different because firefox supports trailers and firefox does not.
Merging multi-headers and trailers is something we need to consider. Should not be a problem as long as the metric names are different among the different headers/trailers
Semantics: Server-Timing is too limited in scope, rename to Server-Metrics · Issue #77 · w3c/server-timing (github.com)
PLH: is it possible for us to reserve a metric name for our specific use?
Yoav: open to it, but don’t see a strong need
PLH: how can we discover which metric names are already in use?
Yoav: suggests HTTP archive as one data source to look for conflicts
Sergey: Is there a recommendation for how proxies and load balancers should handle the server-timing header?
Yoav: there are some open issues but nothing resolved
Kalyana: how difficult was it to convince browser to implement?
Yoav: did this when at akamai. At that time it was specified in chromium and webkit, and firefox followed suit. This was pushed for by the CDN and browser delivery ecosystem
Kalyana: from w3c perspective, how much pushback would you expect from w3c review on piggy-backing on server-timing?
PLH: if something comes up, it would be with the server-timing header itself not with our use of it

kalyanaj commented 9 months ago

Thanks @dyladan for the above notes and summary. As discussed in the W3C DT working group meeting, can we discuss the evaluation criteria (ideally, we should rank them in terms of the most important to least important) and then score these options for those criteria? Please let me know if you prefer a different approach.

Here's my initial attempt at the list of ranked criteria & how these two options (Update: adding a third option for discussion) meet those criteria. Please feel free to edit the below contents directly so that we can collaboratively close on this list:

[Must be standards based] The mechanism must be (or have a path to be) an official W3C standard.
[Trace context propagation from callees to callers] The mechanism must enable propagating traceid, callee span id, flags (sampled flag, random traceid flag, any other future flags) from callees to callers.
[Supported by browsers] The mechanism must have wide support in different browser implementations, so that the above trace context information can be used (e.g., DT for file load) or any browser related DT use cases.
[Supported for server to server use cases] The mechanism must support use for server (callee) to server (caller) trace context propagation.
[Must be extensible in the future] The mechanism must be extensible to support future needs in a backwards compatible manner (e.g., using a version field).
[Semantically clean] The mechanism must cleanly fit with the trace context semantics.
[Reasonably simple to implement] The mechanism must not be unduly complex for implementations.

Are we missing any other major criteria for decision making? Should we add the ones about the proxies/loadbalancers handling?

Here's an attempt at scoring these two options for the above criteria.

Criteria	Traceresponse header	Using server-timing header	USE BOTH!: Traceresponse header for the most part + use server-timing only for initial page load by browsers
Must be standards based	Yes (there's a path)	Yes (there's a path)	Yes (there's a path)
Trace context propagation from callees to callers	Yes	Yes	Yes
Supported by browsers	No (complex to gain adoption)	Yes	Yes
Supported for server-server	Yes	Yes	Yes
Must be extensible in the future	Yes	Yes	Yes
Semantically clean	Yes	No (arguable)	Yes
Reasonably simple to implement	Yes	Yes	TBD

Thoughts? Please feel free to edit directly the list & table above.

kalyanaj commented 9 months ago

Also, I am looking to understand better:

the use cases where browser support is needed for traceresponse header, and...
if those necessitate ranking this criterion higher than other criteria (such as all non-browser scenarios & semantic cleanliness).

Is it the file load scenario: where a call is made to download a file and a trace id is returned as part of the response and the browser needs to continue that trace id for the remaining work?

What are the other interesting use cases? Looking to learn more to improve my understanding of the browser side DT / traceresponse use cases.

yurishkuro commented 9 months ago

I have a somewhat made-up scenario (from a demo app) - the UI reads the traceID from the response header and uses it to display a link to the trace for the previous action. https://github.com/jaegertracing/jaeger/blob/e08f576fd64a992ef0396112bc8401472cc9dd92/examples/hotrod/services/frontend/web_assets/index.html#L109

kalyanaj commented 9 months ago

Added a third option to the above table (keep Traceresponse header but use server-timing header only when returning to browsers) for discussion.

This is based on the assumptions that:

a new header (traceresponse) may not be pragmatic for the initial page load use cases.
however, for other requests (within CORS rules), any headers (including traceresponse) can be sent/received.

If the above assumptions are true (I could be wrong here - not a browser expert) & if there's a way to disambiguate initial page load, then this option maybe worth discussing. Including this option to avoid narrow framing and to widen our options for discussion.

jpkrohling commented 9 months ago

@kalyanaj , about the use-cases where browser support is needed, I believe that @cedricziel can elaborate on that, but here's some more information and context: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1891737668

It's basically the same case that @yurishkuro mentioned before, the only change being that browser-based telemetry tools (like Grafana Faro) can use this header to create span links between frontend and backend traces.

basti1302 commented 9 months ago

Also, I am looking to understand better: the use cases where browser support is needed for traceresponse header, and...

Another use case is a customer support scenario. When a page load fails, having the trace context of that failed page load available in the browser enables showing the trace context on the error page or in automatic ticket creation. Customer support folks can then use the trace id to check the observability tooling to get more information about the failure.

But I believe being able to link the initial page load to the server side trace in the data the client side instrumentation sends to the observability backend is the most relevant use case.

That plus what Ben says here: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1898601419 -- even for requests other than the initial page load (XHR/fetch), using a custom header creates same-origin policy issues. (Yoav pointed out that cross-origin might still be an issue in Safari with Server-Timing, but in general the situation with respect to cross-origin is already a lot better with Server-Timing compared to custom headers.)

gredler commented 1 week ago

Are there any updates on this topic? I see that the Level 3 draft published to the website still uses traceresponse. However, migration sounded pretty likely here: https://github.com/open-telemetry/opentelemetry-specification/issues/3811#issuecomment-1917893672

w3c / trace-context

Revisit header name -- Server-Timing vs. traceresponse #556