w3c / server-timing

Server Timing
http://w3c.github.io/server-timing/
Other
75 stars 20 forks source link

Having duration with no start time makes it hard to profile servers with parallel processes. #10

Closed aickin closed 7 years ago

aickin commented 7 years ago

I was really excited to find out about this spec (via Paul Irish's recent tweet), but on diving in a little I realized the spec doesn't quite cover my use case.

I've worked intermittently on an open-source project called react-server, which aims to be a full featured server framework for React. Recently, someone created a Chrome DevTools extension that lets you examine server timings from react-server using a proprietary way of sending timings, and I was hopeful that it could be replaced by Server Timings. Unfortunately, it can't, because react-server performs a bunch of its server actions in parallel, and this API doesn't support that well.

To see what I mean, imagine a server written something like this:

function renderFirstPart(data) { /* snip */  }
function renderSecondPart(data) { /* snip */  }

const timings = {};

function timePromise(name, promise) {
 const start = new Date();
 return promise.then((result) => {
  timings[name] = new Date() - start;
  return result;
 });
} 

function renderPage() { 
 // make all the API requests.
 const request1 = timePromise("req1", fetch("http://api/url1"));
 const request2 = timePromise("req2", fetch("http://api/url2"));

 // when they come back, render out parts of the page:
 const renderFirstPart = timePromise("render1", request1.then(renderFirstPart));
// the second part depends on request2 and the first part being rendered.
 const renderSecondPart = timePromise("render2", 
  Promise.all[request2, renderFirstPart].then(renderSecondPart));

 sendServerTimingTrailer(timings);
}

Now, let's assume that we get the following timings back from the server:

req1: 1000ms
req2: 1200ms
render1: 500ms
render2: 500ms

Naively, we'd think that the best thing to start optimizing is req2, since it takes the most time. But even if we optimize req2 down to 0ms, the page will take exactly the same amount of time. Why? Because req2 is being done in parallel with req1 and render1, which we can only see with a waterfall display of the server events:

req1:    =========>           1000ms
req2:    ===========>         1200ms
render1:           ====>      500ms
render2:                ====> 500ms

This sort of visualization shows that req1, render1, and render2 are on the critical path, but req2 isn't. Unfortunately, it's not the kind of visualization we can create with the data in Server Timings, because we only get duration.

I know this would be a major change to the spec, but is there any chance that you might be willing to include a relative start time along with duration for servers that perform processing in parallel?

sroussey commented 7 years ago

That is out of scope of the spec. The spec is super simple in this regard, and is meant as a summary. The data in headers can balloon and kill requests, which is what happened many years ago with FirePHP and the wildfire method of sending data over headers. In general, the data can get complicated and large, so people drop an id in the header and the tool does a callback to get the full data in a normal request. Clockwork is an example of this. We forked Clockwork to be more general for our use cases, and called it Clockwork Orange.

aickin commented 7 years ago

The data in headers can balloon and kill requests,

@sroussey thanks for the feedback!

I'm a little confused about what you are saying, though. I think I'm asking for just one new number per metric (a start time to go with each duration sent down in the Server-Timing header); do you think that would balloon and kill requests?

sroussey commented 7 years ago

Maybe the header should have been called "Server-Timing-Summaries". ;) Right now, DB calls are summed together in one header, for example. Summaries don't have beginning and ending times.

If it was "Server-Timeline-Event" instead, you would likely have devs creating a header for each DB call (with start and end time), which might be thousands on a single request (well, hopefully not, but still). And that many headers would kill a request (did things this way once, and sure enough it fails). And the ordering of the headers now matters.

aickin commented 7 years ago

If it was "Server-Timeline-Event" instead, you would likely have devs creating a header for each DB call (with start and end time), which might be thousands on a single request (well, hopefully not, but still).

I'm not clear why having a start and end time makes it more or less likely that a dev would create a header for each DB call. It seems to me equally useful to have a header for each DB call, no matter whether the data returned is duration or duration + start time.

Why do you think that adding a start time would make a dev multiply the number of events tracked?

And the ordering of the headers now matters.

I don't see why header ordering would matter if each entry has a start time. What makes you think that?

Thanks!

igrigorik commented 7 years ago

We discussed this on public-web-perf and deliberately chose to omit startTime: https://lists.w3.org/Archives/Public/public-web-perf/2014Dec/0023.html

The primary issue is that client startTime != server startTime != proxy startTime. Client's time is with respect to the time origin, which depends on navigation and/or worker type, but that has no meaningful mapping to client or server clocks. If we get a bunch of startTime's from the server and proxy, there is no way to resolve which came first and/or how to map them into client's timeline.

My proposal here would be to think if/how you can communicate some of the same information via other means.. e.g., group by descriptions:

Server-Timing: a=1000;group-1, b=1200;group-1, c=50;group-2;

danielcompton commented 7 years ago

I had the exact same thoughts about adding a start time when I saw the spec. I can understand how having server and proxy times isn't really possible to get right, but I would suggest that this shouldn't let perfect get in the way of good.

On the other hand, it depends a little bit on the intended use of Server-Timing.

If server-timing is meant to be a full client and server request tracing tool, then start/end times are important. The discussion here suggests that Server-Timing isn't intended to replace/provide a full server side request timeline, rather it provides higher level timing summaries, and is intended to provide metrics on the overall request/response including CDN's, loadbalancers, proxies, e.t.c. Those metrics may be harder to collect in a unified way on the server, so collecting them in the client makes sense. If that is the intended purpose, then I can see the logic in deliberately keeping the API limited.

igrigorik commented 7 years ago

The discussion here suggests that Server-Timing isn't intended to replace/provide a full server side request timeline, rather it provides higher level timing summaries, and is intended to provide metrics on the overall request/response including CDN's, loadbalancers, proxies, e.t.c.

Yep. ST is not a general purpose tracing framework. We deliberately designed it as a minimal surface to enable summary server diagnostics -- nested records, execution traces, etc., are out of scope and not something we want to encourage via this API.

Stepping back.. The original question here was about absence of startTime timestamps for such events, which is a recurring question that we should address and explain in the spec. /cc @cvazac

igrigorik commented 7 years ago

https://github.com/w3c/server-timing/pull/15 added a note to explain current behavior. Closing, feel free to reopen if we need to revisit.