w3c / navigation-timing

Navigation Timing
https://w3c.github.io/navigation-timing/
Other
116 stars 30 forks source link

Add `confidence` field to PerformanceNavigationTiming #202

Open mwjacksonmsft opened 3 months ago

mwjacksonmsft commented 3 months ago

Web applications may suffer from bimodal distribution in page load performance, due to factors outside of the web application’s control. For example:

In these scenarios, content the web app attempts to load will be in competition with other work happening on the system. This makes it difficult to detect if performance issues exist within web applications themselves, or because of external factors.

Teams we have worked with have been surprised at the difference between real-world dashboard metrics and what they observe in page profiling tools. Without more information, it is challenging for developers to understand if (and when) their applications may be misbehaving or are simply being loaded in a contended period.

A new ‘confidence’ field on the PerformanceNavigationTiming object will enable developers to discern if the navigation timings are representative for their web application.

Explainer: https://github.com/MicrosoftEdge/MSEdgeExplainers/blob/main/PerformanceNavigationTiming%20for%20User%20Agent%20Launch/explainer.md

Chromium Status: https://chromestatus.com/feature/5186950448283648

/cc @yoavweiss

clelland commented 3 months ago

@csharrison FYI

csharrison commented 3 months ago

Thanks for tagging me. I am excited to see this proposal progress. On the last web perf call there was some mention of making this extensible to multiple data types beyond confidence on PerformanceNavigationTiming. Is that planned / likely? The reason I ask is that we may want to consider noising mechanisms that support multi-dimensional data to future-proof us in that case.

mmocny commented 3 months ago

Wanted to chime in with some soft feedback about just the API shape:

Right now the proposal is to just expose a confidence field with a literal primitive value ("high" or "low").

Any existing user / observer of the navigation timing API (there are lots) just looking at the raw output would need to know ahead of time that this is a "fuzzed" value with a specific epsilon. It feels to me like most folks in most situations would not have this extra context, and it would be worth being more self-documenting as it is a very new use case...

Strawman: What about some value-wrapper to make it very explicit that the value is fuzzed?:

interface Fuzzy<T, U extends number> {
  fuzzyValue: T;
  epsilon: U;
}

Then confidence be of type Fuzzy<string, 1.1>. This would be much more self-documenting for any readers, and also would probably extrapolate better for adding more such values?


I do see a reference in the alternatives considered section to separating the ancillary data and exposing the triggerRate, which somewhat overlapping -- but that alternative evaluated grouping all the data values together and highlighted the complexities there.

I am not sure: is coupling all fuzzed values a necessary requirement in order to maintain unreliability? It seems to me like maybe not necessarily, unless the values being reported are inherently correlated?


I guess another alternative would just be to rely on naming convention: fuzzyMaybeConfidenceValue: "high" but I like that less.

csharrison commented 3 months ago

+1 to including the epsilon value (or some rate of flipping) in the API itself. This provides a few benefits:

In either of these cases, providing the value upfront ensures consumers can interpret the data properly (which is important for the debiasing step).

mwjacksonmsft commented 3 months ago

@mmocny - Thanks for that feedback. I can update the proposal to ensure we capture the triggerRate as part of the API shape. I don't see anything in the webidl spec that describes exactly what you suggested. Do you know if that's possible?

@csharrison - Thanks! I think the next request would be "what conditions triggered this to be low confidence?".

We discussed something like this offline:

    cpuPressureState: { "nominal", "fair", "serious", "critical" }
    thermalsPressureState: { "nominal", "fair", "serious", "critical" }
    isColdStart: { true, false }
    userAgentPressureState: { "nominal", "fair", "serious", "critical" }
    gpuPressureState: { "nominal", "fair", "serious", "critical" }
    eplison: <float>

This isn't an exhaustive list of conditions (https://github.com/w3c/web-performance/wiki/Nice-things-we-can%27t-have), and it already has a fairly high flip probability with RR. I'm concerned there may not even be enough data in a particular bucket to successfully debias the data, given the flip rate. I'm not familiar with the ins and outs of the more complex local differential privacy algorithms though, so open to ideas :)

One other idea I had, that might be a simpler(?) approach to answer why the confidence rating is low, is something like:

enum ConfidenceReason  {
    coldStart,
    cpuPressure,
    thermalsPressure,
    userAgentPressure,
    gpuPressure,
}

and then that could be exposed via

    sequence<ConfidenceReason> confidenceReasons;

But we'd probably want to cap the number of reasons returned, and maybe limit it being non-empty to only low confidence cases to help reduce the flip probability.

csharrison commented 3 months ago

@mwjacksonmsft thanks. One other question: let's say we expand this interface to support more data types - would we expect some users to still prefer a higher accuracy single confidence number rather than the more granular data you describe?

If so, we may want to consider a "query" type API, where rather than having some static fields on PerformanceNavigationTiming, we have a new method which allows the caller to e.g. query either the confidence bit, or the full conditions list you outlined, or both. This may allow some use-cases to get higher accuracy in return for coarser data.

For the two extensions you mentioned, there:

I would also suggest considering an even simpler version of sequence<ConfidenceReason> that just emits a single confidence reason (if any), and picks one at random if multiple exist. This would only have 6 output states.

In any case, I have a colab showing how how variance will change under both randomized response (krr) and the more advanced RAPPOR algorithm as the number of dimensions increases (with epsilon=ln 3, # of dimensions/outputs on x axis): image

It would not surprise me if in this high privacy regime, the utility will be bad as we increase dimensions from binary (which has variance < N).

mwjacksonmsft commented 3 months ago

This issue came up for a discussion today in the WG. To clarify my statement, I don't have an immediate ask from anyone asking for that, but I can see that being the next ask. I'm reaching out to our customers to get their input.

What factors would need consideration if we were to extend this at a later point in time?

For example, you mentioned in the collab "The other downside of RAPPOR is that in the epsilon=ln(3) range, RAPPOR underperforms k-RR until k>5.". If we picked RR now, would it be reasonable to switch the algorithm out when K> 5 if that were expressed in some form in the API?

csharrison commented 3 months ago

@mwjacksonmsft I think the primary factor is just dealing with a breaking change. E.g. if we move from randomized response to RAPPOR, everyone will need to update their code to deal with a new format / debiasing strategy. To mitigate this, we could try to make the API forward-compatible with algorithm changes, but that increases complexity. LMK if it makes sense.

mwjacksonmsft commented 3 months ago

@csharrison Could you elaborate on how the API shape might need to change to be forward compatible? I'd imagine that the debiasing strategy would be the more problematic aspect. Or am I missing something?

csharrison commented 3 months ago

There is a spectrum of breakage:

  1. The flip probability of confidence changes. This is a small issue but resolvable by downstream systems with maybe a single line of code update.
  2. The debiasing strategy of confidence changes. This might be the case if e.g. we change confidence to e.g. sometimes flip more from high to low, than we do from low to high. To resolve this we may need to introduce more information about how confidence is being noised
  3. We could just always set confidence to high (or low) but officially deprecate it in favor of a new mechanism.
  4. We could remove confidence in favor of a different mechanism entirely, which could break JS

I think (3) and (4) are probably the worst, so let me give you an example of how we could get there. Imagine we do some research and it turns out there are some use-cases that want to capture any ConfidenceReason, but some use-cases that really just care about coldStart. Now, we will always have pressure to reduce the noise, so advocates of just querying coldStart ask to use all the privacy budget for the other signals (confidence, ConfidenceReason) to just query coldStart by itself, and get minimal noise.

This is a reasonable request! However, we supply confidence directly on PerformanceNavigationTiming, it isn't an opt-in API, everyone just gets it automatically. This makes it difficult for a caller to explicitly say they don't want it, because they want to use the scarce privacy budget on a more tightly scoped query.

A possible alternative could be a dynamic method on PerformanceNavigationTiming like querySensitiveAttribute('confidence',...) which would do the privacy mechanism on-demand, and allow for more flexibility if we offered more data, more algorithms, etc. Maybe overkill, but worth thinking about if we're excited about future extensibility.

nicjansma commented 3 months ago

Note: this was discussed on the June 6 2024 W3C WebPerf Working Group call, minutes here.

There was some discussion on what "low confidence" means and a request to discuss some of the use-cases further.

mwjacksonmsft commented 3 months ago

@csharrison Thanks. I've heard back from our customers. As long as the contributing factors remain outside of their control, they couldn't immediately think of needing to know a reason why the confidence value was low.

To the example that you highlighted, we need to think through if/how those values would be exposed via toJSON.

mwjacksonmsft commented 3 months ago

I've pushed an updated version of the explainer that includes the randomizedTriggerRate field.

csharrison commented 3 months ago

@mwjacksonmsft sounds good. If we're reasonably confident we can stick with just a confidence field for the time being I am happy with the proposal as is (with the addition of randomizedTriggerRate). I do think for the privacy we are considering this will be best for callers vs. trying to get more data with more noise.

yoavweiss commented 3 months ago

Following the discussion here and looking the explainer, I think we have two alternative API shapes.

NavigationTiming entry attribute

Adding a PerformanceNavigationTimingConfidence attribute on the performance timeline. To address @mmocny's concerns, we could either name the attribute something like randomizedConfidence or make sure that the internal value is named something like randomizedValue.

The pro of this approach is that we'd be attaching the value to the NavigationTiming timeline itself, making it clearly about navigation timing. It's also simple and discoverable.

The cons:

performance.getConfidence(["navigation"], {})

A direct API that provides the confidence signal.

The pro of that approach is that the API would enable developers to provide parameters, and would enable us to evolve it over time (e.g. change the algorithm, provide confidence for other metrics, etc).

The con of that it'd be less discoverable.

I think it all boils down to how likely we think it is that we'd expand this signal over time and beyond navigation loading times. I'd love opinions on that front.

mwjacksonmsft commented 3 months ago

The second proposal is very similar to Add new Type value for performance.getEntriesByType

If we were to pursue that approach, and we expanded this signal beyond navigation load times, there needs to be some way of correlating the confidence value returned from the new API with corresponding performance object. Two examples that come to mind are: 1) Both could contain a unique identifier (e.g. the document name) 2) We could pass the object into the API itself (performance.getConfidence(performance.getEntriesByType('navigation')[0])).

As randomized confidence data is only useful for backend processing, I expect developers need to bundle it up with the original object and include both for backend processing. Something like:

let navObj = performance.getEntriesByType('navigation')[0];
let confObj = performance.getConfidence(navObj);
navObj ['confidence'] = confObj;
// Send to server
ear-dev commented 2 months ago

We are also frequently seeing bimodal distributions when analyzing website performance, and have started tracking some of these headers, which has given us more detailed information about “server think time”...

Has anyone else explored those to see if they are part of the reason for this behavior and thought about how to include this in any 'confidence' interval.

This site is a good example: https://www.webpagetest.org/result/240624_AiDcCE_AR0/ Across three runs you can see these values in the root response headers:

And subsequently divergent FCP values which seem to be correlated:

image
mwjacksonmsft commented 2 months ago

Hi @ear-dev - The proposal has been mostly focused on factors that impact the user agent, so I hadn't considered server-side timings in this proposal.

In local testing, I do see these reflected in the `serverTimings payload in performance.getEntriesByType("navigation")[0]:

{
    "name": "cfRequestDuration",
    "duration": 490.999937,
    "description": ""
}

Does this information meet your needs to determine if the page is slow due to "server think time"?

mwjacksonmsft commented 2 months ago

@yoavweiss I ended up building a couple of different prototypes to test out these options.

The first one attaches the confidence value to the PerformanceNavigationTiming object, this is what the explainer describes.

The second one uses a getConfidence to return the confidence values for a given entry type.

The third one builds upon the first and allows a dictionary object to be passed via getEntriesByType. If the dictionary is not passed, then the confidence field returns null, otherwise it returns the expected value.

I think there are two main concerns I have with the second option. Firstly, developer ergonomic concerns - the data isn't useful locally, and the only thing you can do is bundle it up for backend processing. Secondly, providing too many configuration options, could potentially introduce privacy concerns if called multiple times with different parameters. Admittedly it's hard to quantify that without a more concrete proposal. This second concern is equally applicable to the third option.

yoavweiss commented 1 month ago

Apologies for my slowness.

I think that the main difference between 1 and 2/3 is related to future extensibility and ergonomics.

From my perspective option 2 is more flexible (and e.g. we could extend it in the future to have different confidence levels per entry (e.g. if the page started loading under stress, but then calmed down later, we'd be able to express that if we'd so wish). The main question is if that flexibility comes at a cost of ergonomics and/or discoverability.

I think there are two main concerns I have with the second option. Firstly, developer ergonomic concerns - the data isn't useful locally, and the only thing you can do is bundle it up for backend processing.

Sure, but that's also true for NavigationTiming. The data is only useful in aggregate, but with (1) if we want to know the confidence value of a specific entry, we'd need to inspect its relative NavigationTiming entry, which feels less ergonomic somehow.

Secondly, providing too many configuration options, could potentially introduce privacy concerns if called multiple times with different parameters.

I wouldn't expect the confidence level to change when inspected multiple times. Would it?

noamr commented 1 month ago

@mmocny - Thanks for that feedback. I can update the proposal to ensure we capture the triggerRate as part of the API shape. I don't see anything in the webidl spec that describes exactly what you suggested. Do you know if that's possible?

@csharrison - Thanks! I think the next request would be "what conditions triggered this to be low confidence?".

We discussed something like this offline:

    cpuPressureState: { "nominal", "fair", "serious", "critical" }
    thermalsPressureState: { "nominal", "fair", "serious", "critical" }
    isColdStart: { true, false }
    userAgentPressureState: { "nominal", "fair", "serious", "critical" }
    gpuPressureState: { "nominal", "fair", "serious", "critical" }
    eplison: <float>

This isn't an exhaustive list of conditions (https://github.com/w3c/web-performance/wiki/Nice-things-we-can%27t-have), and it already has a fairly high flip probability with RR. I'm concerned there may not even be enough data in a particular bucket to successfully debias the data, given the flip rate. I'm not familiar with the ins and outs of the more complex local differential privacy algorithms though, so open to ideas :)

One other idea I had, that might be a simpler(?) approach to answer why the confidence rating is low, is something like:

enum ConfidenceReason  {
    coldStart,
    cpuPressure,
    thermalsPressure,
    userAgentPressure,
    gpuPressure,
}

and then that could be exposed via

    sequence<ConfidenceReason> confidenceReasons;

What would be the outcome of exposing these? What's the action a web author can take when "the confidence in the navigation entry is low because of thermal pressure"?

mwjacksonmsft commented 1 month ago

@mmocny - Thanks for that feedback. I can update the proposal to ensure we capture the triggerRate as part of the API shape. I don't see anything in the webidl spec that describes exactly what you suggested. Do you know if that's possible? @csharrison - Thanks! I think the next request would be "what conditions triggered this to be low confidence?". We discussed something like this offline:

    cpuPressureState: { "nominal", "fair", "serious", "critical" }
    thermalsPressureState: { "nominal", "fair", "serious", "critical" }
    isColdStart: { true, false }
    userAgentPressureState: { "nominal", "fair", "serious", "critical" }
    gpuPressureState: { "nominal", "fair", "serious", "critical" }
    eplison: <float>

This isn't an exhaustive list of conditions (https://github.com/w3c/web-performance/wiki/Nice-things-we-can%27t-have), and it already has a fairly high flip probability with RR. I'm concerned there may not even be enough data in a particular bucket to successfully debias the data, given the flip rate. I'm not familiar with the ins and outs of the more complex local differential privacy algorithms though, so open to ideas :) One other idea I had, that might be a simpler(?) approach to answer why the confidence rating is low, is something like:

enum ConfidenceReason  {
    coldStart,
    cpuPressure,
    thermalsPressure,
    userAgentPressure,
    gpuPressure,
}

and then that could be exposed via

    sequence<ConfidenceReason> confidenceReasons;

What would be the outcome of exposing these? What's the action a web author can take when "the confidence in the navigation entry is low because of thermal pressure"?

@noamr I connected with our customers about this. Their feedback was that as long as the contributing factors remain outside of their control, they couldn't think of a reason to know the reason why the confidence value was low. Consequently, I've dropped this from the proposal.

mwjacksonmsft commented 1 month ago

@yoavweiss -

I think that the main difference between 1 and 2/3 is related to future extensibility and ergonomics.

From my perspective option 2 is more flexible (and e.g. we could extend it in the future to have different confidence levels per entry (e.g. if the page started loading under stress, but then calmed down later, we'd be able to express that if we'd so wish). The main question is if that flexibility comes at a cost of ergonomics and/or discoverability.

Is the suggestion that this API might return more than one entry per type or that we'd update the existing entries as new information became available? Or something else?

I think there are two main concerns I have with the second option. Firstly, developer ergonomic concerns - the data isn't useful locally, and the only thing you can do is bundle it up for backend processing.

Sure, but that's also true for NavigationTiming. The data is only useful in aggregate, but with (1) if we want to know the confidence value of a specific entry, we'd need to inspect its relative NavigationTiming entry, which feels less ergonomic somehow.

In the prototype I built, I ended up with a preference for (1) for two reasons: 1) The value returned by (2) needs to be used with performance entries to be meaningful. 2) If this were to be extended to other performance entry types, the existing observer patterns continue to work.

However, I could see (2) being updated to take an entry instead of an entry type, which addresses those concerns. Perhaps something like:

let entries = performance.getEntriesByType("navigation"); let [confidence] = performance.getConfidenceForEntries(entries);

WDYT?

Secondly, providing too many configuration options, could potentially introduce privacy concerns if called multiple times with different parameters.

I wouldn't expect the confidence level to change when inspected multiple times. Would it?

@csharrison mentioned this:

A possible alternative could be a dynamic method on PerformanceNavigationTiming like querySensitiveAttribute('confidence',...) which would do the privacy mechanism on-demand, and allow for more flexibility if we offered more data, more algorithms, etc.

I was expressing a concern that if we allowed this, that it might result in less privacy if called multiple times requesting different sensitive attributes.

noamr commented 1 month ago

I actually think the idea to hang this on the observer is the most consistent. Also for navigation timing, reading this value before the load event might mean that it can still change (and perhaps the confidence value can change as well?) Which makes the observer a better candidate than performance.get*.

Something like: observer.observe({type: "navigation", metadata: ["confidence"], buffered: true}); or some such

mwjacksonmsft commented 1 month ago

@noamr Were you thinking that the PerformanceObserverEntryList would have a getConfidenceEntries (or similarly named) method?

noamr commented 1 month ago

@noamr Were you thinking that the PerformanceObserverEntryList would have a getConfidenceEntries (or similarly named) method?

No, I think that once you explicitly opted in to this in the observer, we can simply add confidence or some such on the regular timing entry.

mwjacksonmsft commented 1 month ago

@noamr Thanks for clarifying. I imagine that case, if the developer makes this call let [entry] = window.performance.getEntriesByType('navigation');, before using an observer, then entry.confidence would return null.

However, if they held onto entry, and then called the observer, then entry.confidence would return the confidence value.

e.g.

let [entry] = window.performance.getEntriesByType('navigation');

// entry.confidence returns null here

const observer = new PerformanceObserver((list, obj) => {
  list.getEntries().forEach((entry) => {
    console.log(entry.confidence);
  });
});

observer.observe({type: "navigation", metadata: ["confidence"], buffered: true});

// entry.confidence returns a PerformanceTimingConfidence object.

Does that align with how you were thinking about it?

Here is a prototype of the API changes: 5766476: Prototype implementation confidence from observer | https://chromium-review.googlesource.com/c/chromium/src/+/5766476

noamr commented 1 month ago

@noamr Thanks for clarifying. I imagine that case, if the developer makes this call let [entry] = window.performance.getEntriesByType('navigation');, before using an observer, then entry.confidence would return null.

However, if they held onto entry, and then called the observer, then entry.confidence would return the confidence value.

e.g.

let [entry] = window.performance.getEntriesByType('navigation');

// entry.confidence returns null here

const observer = new PerformanceObserver((list, obj) => {
  list.getEntries().forEach((entry) => {
    console.log(entry.confidence);
  });
});

observer.observe({type: "navigation", metadata: ["confidence"], buffered: true});

// entry.confidence returns a PerformanceTimingConfidence object.

Does that align with how you were thinking about it?

Here is a prototype of the API changes: 5766476: Prototype implementation confidence from observer | https://chromium-review.googlesource.com/c/chromium/src/+/5766476

Need to think about exact API names but this is the direction I was thinking about, yes.

yoavweiss commented 1 month ago

The downside of having this only be available to performance observers is that it'd be impossible to collect this data for navigations that never make it to their load event.

noamr commented 1 month ago

The downside of having this only be available to performance observers is that it'd be impossible to collect this data for navigations that never make it to their load event.

Can the confidence level change between receiving the response headers and the load event?

Also, is this planned to be exposed in iframes?

mwjacksonmsft commented 1 month ago

Can the confidence level change between receiving the response headers and the load event?

It seems unlikely, but I'm not sure. The use case for our customers is for navigation that occurs during a user agent cold launch. We've discussed future potential factors such as extensions impact, or other system resource considerations (e.g. high cpu usage). https://github.com/w3c/web-performance/wiki/Nice-things-we-can%27t-have

Also, is this planned to be exposed in iframes?

I don't have a preference if this is exposed within iframes or not. For SystemEntropy, we did decide it shouldn't be exposed within iframes, but that was without any privacy protections.

mwjacksonmsft commented 4 weeks ago

Can the confidence level change between receiving the response headers and the load event?

It seems unlikely, but I'm not sure. The use case for our customers is for navigation that occurs during a user agent cold launch. We've discussed future potential factors such as extensions impact, or other system resource considerations (e.g. high cpu usage). https://github.com/w3c/web-performance/wiki/Nice-things-we-can%27t-have

@noamr Upon re-reviewing the data we collected, the data suggests that most of the randomness that occurred was between navigationStart and responseEnd. There didn't appear to be much variation after domLoading. However, the caveat to that is this data was narrowly collected for user agent launch scenarios.

csharrison commented 1 week ago

I have one more small suggestion for this proposal regarding how developers deal with noise. There is a bit of a "foot gun" with respect to how developers split + aggregate data based on the noisy confidence bit based on the fact that the noise mechanism has bias. In the slides I presented to the group I have a formula to debias an aggregate, but it requires keeping around the epsilon parameter alongside records, and is an additional somewhat non-trivial server-side step.

One simplification for developers would be for the platform to debias each report individually. This could look like exposing both a noisy confidence field, along with an unbiased_high_confidence_count with each navigation. Here is how it would work:

Imagine we have an epsilon = ln(3) as our privacy parameter.

These numbers come from the formula $f(x) = \frac{x - p/2}{1-p}$ where $x$ is 1 for high confidence and 0 otherwise.

If you have a slice of records and you want to count how many are high confidence, you can just sum up each record's unbiased count without doing any other math and you will get an unbiased estimate of the total. For a histogram breakout, let the mass of each record in the histogram be its unbiased count, etc. etc.

noamr commented 1 week ago

Can the confidence level change between receiving the response headers and the load event?

It seems unlikely, but I'm not sure. The use case for our customers is for navigation that occurs during a user agent cold launch. We've discussed future potential factors such as extensions impact, or other system resource considerations (e.g. high cpu usage). https://github.com/w3c/web-performance/wiki/Nice-things-we-can%27t-have

@noamr Upon re-reviewing the data we collected, the data suggests that most of the randomness that occurred was between navigationStart and responseEnd. There didn't appear to be much variation after domLoading. However, the caveat to that is this data was narrowly collected for user agent launch scenarios.

OK, I guess what we know that feeds into confidence (e.g. cold start) is known when the document is created. Still, I think we should figure out if this feature should be available in iframes, to avoid a situation where multiple iframes are created to try to track changes in confidence continuously to reduce the noise.

mwjacksonmsft commented 1 week ago

OK, I guess what we know that feeds into confidence (e.g. cold start) is known when the document is created. Still, I think we should figure out if this feature should be available in iframes, to avoid a situation where multiple iframes are created to try to track changes in confidence continuously to reduce the noise.

I'm comfortable with returning null for the confidence attribute within iframes. I don't believe in its current form this could be used to track changes in confidence in real time. I'd prefer to start with a scoped change and expand once we've had a chance to assess any potential additional privacy risk. WDYT?