Generic model for causal attribution (non-sampling based)

patrickhulce commented 3 years ago

Several issues and previous design documents have demonstrated a need for developers to be able to identify work in order to fix the issue. Proposals thusfar have mostly focused on what script is currently being executed as opposed to what script is ultimately responsible for the long task occurring in the first place. I'd like to propose a generic model for attribution based on causality instead of sampling.

I discussed this proposal and the difference between these approaches at a WebPerfWG 2020 TPAC session. Recording Slides

Brief Summary of Benefits:

Provides unique insight not already available via the JS Sampling API.
More intuitive starting point for developers to investigate on pages with varied authorship (easily identifies third-party sources).
Does not require heavy _intra_task bookkeeping.
Proven track record for matching developer intuition in the Lighthouse project.

Very Rough Implementation Description:

Terminology:

initiating invocation: a specific invocation of a web API that schedules a new task
- Examples:
- setTimeout
- fetch
- addEventListener
causal task: the task that was ultimately responsible for another task's existence
The causal task of any given task is the result of traversing the tree of initiating invocations until a task is reached that was the initial evaluation of a script resource with a URL.
- Generate a numeric identifier for each main-thread task and initiating invocation
- Maintain a map of initiating invocation ID to the causal task ID
- Upon future tasks scheduled as a result of an initiating invocation, associate any new initiating invocations with the same causal task ID as the current task.

Questions

How can the Lighthouse project or me personally help support this effort? :)

npm1 commented 3 years ago

Several questions come to mind:

If we wanted to expose this to RUM, what would be a sensible security model to do so?
Is it possible to compute this efficiently? (I see you mention no heavy intra_task bookkeeping, not sure I follow).
Do we have customers lined up to try this? It sounds like this feature is complex enough that it may benefit from an Origin Trial in Chrome to ensure we get the right API shape and implementation.

Thanks for the great talk, by the way! It's exciting to see some novel ideas on the long-standing problem of longtasks attribution.

patrickhulce commented 3 years ago

If we wanted to expose this to RUM, what would be a sensible security model to do so?

The brief justification for why this doesn't expose new information is that one could feasibly create pages that optionally include/exclude scripts from other origins and observe the delta in long tasks to identify which script caused which long task. The fact that this is burdensome is the problem developers have today (one must run expensive A/B testing in order to learn this information).

Is it possible to compute this efficiently? (I see you mention no heavy intra_task bookkeeping, not sure I follow).

The intratask comment is to highlight the overhead of this approach in contrast to a sampling approach where overhead of implementing a sampling profiler on a JS engine is non-trivial. With this approach only a select few JS APIs that already kick out to browser scheduling require lightweight instrumentation. A single unsigned int (maybe a long?) needs to be kept per toplevel task until a chain is resolved, but I imagine allowances should be made to evict tasks in the far past to maintain a low memory impact. The listener component to this is probably the heaviest part.

Do we have customers lined up to try this? It sounds like this feature is complex enough that it may benefit from an Origin Trial in Chrome to ensure we get the right API shape and implementation.

I agree an origin trial makes the most sense. I don't know of any immediate customers but I can ask around.

npm1 commented 3 years ago

Just wanted to ping this to ask: are you aware of any potential customers for this data? Also adding @spanicker as this seems related to their problems of finding the 'FID culprit'.

patrickhulce commented 3 years ago

I am not though I imagine RUM perf monitoring solutions might be interested based on casual conversation? @spanicker might have more leads on where to go first with the FID attribution overlap :) 🤞

omriariav commented 3 years ago

@npm1 as a 3rd party vendor, we will find this useful to easily isolate our TBT (long tasks) impact and optimize it; this goes for both ad hoc fixes and constant monitoring over field data that we collect. I hope it helps.

npm1 commented 2 years ago

Would it be possible to implement this with your task tracking idea @yoavweiss ? How standardizable is that

yoavweiss commented 2 years ago

Would it be possible to implement this with your task tracking idea @yoavweiss ?

I believe so, but haven't prototyped that specifically.

How standardizable is that

I'll need to think about it a bit, but in theory it seems like we could integrate with the event loop's task posting and keep track of ancestry there.

noamr commented 1 year ago

I'm currently prototyping this or something similar.

noamr commented 6 months ago

Closing this in favor of w3c/long-animation-frames

w3c / longtasks

Generic model for causal attribution (non-sampling based) #89