Make fast enough for untriggered use

erikrose commented 5 years ago

Because it needs access to the DOM, Fathom currently wants to run on the main thread. Unless run in response to a user action, it can create a little jank, taking upwards of 40ms to run a ruleset, so we dare not run it, for example, on every page load. Can we speed it up or find a way to run it offthread?

One approach is to make Fathom run faster. About 80% of its runtime on the Pricewise ruleset is spent in DOM routines. Those do a lot of flushing of layout and other pipeline stages, redoing calculations unnecessarily. Is this a major source of wasted time? Measure. Are there lower-level hooks we can use? (window.windowUtils.getBoundsWithoutFlushing() might be a faster way of getting element size, for example. mattn suggested it. Also see https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Performance_best_practices_for_Firefox_fe_engineers, which has, for instance, routines to get the window size and scroll without flushing things.) Other ideas?

Can we run Fathom offthread without losing access to too much signal? Reader Mode currently serializes the markup (only) and ships it offthread to parse. Could we do something like that but also apply CSS ourselves offthread? Would that preserve enough signal for most rulesets? Would it be too slow or battery-hungry on 2-core mobile devices?

This bug is done when we can blithely run a Fathom ruleset on every Firefox page load without concern for dragging down the UX.

biancadanforth commented 5 years ago

I've been talking with a number of Firefox/performance engineers (Gijs, Emilio, Rob, Greg) about this and have some useful information to share.

Profiling Fathom 3.0 in Price Tracker

Most of the conversations centered around improving Fathom as-is (blocking the main thread) in Price Tracker.

TL;DR: isVisible, which will likely be a common rule in many Fathom applications, accounts for the majority of the 460 ms of Fathom-related jank (67%). It may be possible to reduce this overall Fathom jank by as much as 374 ms (81%) by reducing style and DOM property accesses in isVisible, but this solution requires a privileged context.

This case study helped me to develop some general performance strategies shared below.

General strategies for writing a performant ruleset (executed in the main thread)

Run Fathom during browser idle times with requestIdleCallback
Run Fathom asynchronously
Early return in expensive functions where possible to avoid unnecessary work.
Minimize the number of times expensive functions are called.
Memoize computationally intensive results to avoid redundant work (e.g. with JS Maps or note)
For layout styles access, minimize extra layout flushes (e.g. by getComputedStyle or getBoundingClientRect) by using the Intersection Observer API and/or promiseDocumentFlushed
Minimize DOM accesses as they are even more expensive than style/layout accesses, thanks to X-Rays.

Ways to improve the Fathom library itself

Consider making it async rather than sync

On parallelism: running Fathom off the main thread

Per Gijs (Firefox front-end engineer; worked on ReaderMode):
- Reader mode actually serializes the DOM and ships it off to a worker where it gets reparsed, but that loses style information and is also slow
- You could use a separate process, but then you'd have to do a second load of the page in that other process, and that will probably make things even slower, plus there's no guarantee the result would be the same.
Per Emilio (Firefox CSS/layout/graphics engineer):
- Accessing DOM and layout information off-main-thread is not trivial. In the style system we access DOM information off-main-thread, but while the main-thread is paused.

Open questions

Can we get away with the "clickability" approach to isVisible for big performance wins in Firefox applications?
How could we make Fathom execution more concurrent in the main thread?
- How much of an improvement and how significant would the work be to make Fathom async?
Should we give up on parellelism for now?

emilio commented 5 years ago

I'd note that it should be possible also to push in the CSS Working Group to get something like elementFromPoint{s} to take a set of options to ignore the viewport clip or something.

That should make it work everywhere. Apparently there are a few old requests for that: https://lists.w3.org/Archives/Public/www-style/2012Oct/0683.html

cssom-view is unmaintained atm, but I'd be happy to help out there.

emilio commented 5 years ago

I filed https://github.com/w3c/csswg-drafts/issues/4122 to try to standardize something that would've helped here.

biancadanforth commented 5 years ago

Is it possible to tell what style accesses trigger a layout flush?

Per Emilio:

In practice, a good rule of thumb is something like "if there's something that depends on up-to date layout information, that flushes layout". Same for style and paint.

It is also pretty easy to test. Create a big page (or open one, like https://html.spec.whatwg.org/), and write a loop like:
var start = performance.now();
for (var i = 0; i < 1000 /* insert/remove zeros as needed */; i++) {
document.documentElement.style.display = i % 2 == 0 ? "none" : "";
theApiYouWantToTest();
}
console.log(performance.now() - start);
If theApiYouWantToTest() flushes layout, you'll see it takes massively longer than if it doesn't. Compare putting something simple like document.documentElement.style.color (which doesn't need to update the style of the page) with document.documentElement.getBoundingClientRect() (which updates layout).

Note that the document.documentElement.style.display = i % 2 == 0 ? "none" : ""; is to ensure that a layout style changes in each iteration. This makes the flush as expensive as possible and the time difference between something that does and doesn't trigger a flush very apparent.

In reality, it's possible to use something like getBoundingClientRect that causes a flush without changing any layout styles (like the display value), so the cost of the flush is much reduced with an earlier return (as might be the case in isVisible, which may be why the jank was dominated by XRays rather than layout work (see the last section)).

Why we probably need to make Fathom async

As noted here, the original, sync implementation of isVisible in Price Tracker caused the majority of Fathom-related jank on a sample page. What I discovered yesterday is that, if I only change when isVisible is executed (i.e. run it asynchronously immediately after a paint, so as not to trigger unnecessary layout flushes), there was a 42% reduction in Fathom-related jank!

This improvement would be on top of any performance improvements to isVisible itself.

One less-than-ideal option is to add a new, async pre-processing step to Fathom that runs before the ruleset is executed. This step would only run if Fathom's isVisible function is being used in the ruleset.

The best option, however, is for Fathom itself to be made async. Something like:

const results = await rules.against(document);

...and inside the ruleset where it uses isVisible, its execution would pause until it could be run right after a paint.

Making Fathom async will enable further concurrency (see item 3) as well.

@erikrose , Should we break out "Make Fathom async" into a separate issue for discussion?

erikrose commented 5 years ago

@erikrose , Should we break out "Make Fathom async" into a separate issue for discussion?

Yes, please. Is seems to me it should be possible to take a middle-of-the-road approach as well: call the existing synchronous Fathom routines in a requestAnimationFrame() callback, thus calling geometry-using routines like isVisible() at the optimal time without requiring a rewrite of the Fathom execution machinery. Correct?

On the same subject, I do notice that requestAnimationFrame() itself probably ceases to call its callbacks on background or otherwise invisible tabs. Whether this is a problem depends on the application, but it's something to keep in mind.

biancadanforth commented 5 years ago

Fathom in Firefox: Initial performance discussion with the Performance team on the Smoot project

Background:

I filed a Performance Review Request[1] outlining the high level details for the Fathom/Smoot project, which is expected to be the first Firefox application of Fathom, and Erik and I met with dothayer from the performance team last week.

Key takeaways:

The Smoot project is a proving ground for Fathom. Given that Smoot is an experiment and is not user-facing, we can make sacrifices here that we might not be able to make for a product application of Fathom:
- If there is concern that Fathom is still too slow despite optimization efforts, we could limit which channels and what percentage of users receive the experiment.
- Even if it takes Fathom hundreds of milliseconds to complete, it would make little difference from the user perspective provided that the work negligibly affects page performance and responsiveness (i.e. the work is done off-thread, out of process or at optimal times in the main thread).
promiseDocumentFlushed (or the unprivileged requestAnimationFrame/setTimeout workaround) monitors style changes for the current frame, and it is possible to do all of Fathom's synchronous work in a single promiseDocumentFlushed callback.
- A pre-processing step as done for one of the Price Tracker isVisible experiments (second section) is unnecessary. This is because when the callback is executing the synchronous code, it is running in the main thread, and it cannot be interrupted (say by something else that triggers another flush) until it completes.
Making the promiseDocumentFlushed optimization only eliminates flushes, which is probably not the main performance problem. The main problems are more likely to be XRay work and the fact that Fathom will likely take too long to execute in one synchronous chunk. Therefore, a separate and more challenging optimization would be to amortize the work; for this requestIdleCallback seems best suited.
The difference between promiseDocumentFlushed and requestIdleCallback is that the former would only eliminate unnecessary layout flushes from the Fathom work. The latter actually executes the code at idle times in the browser.
- Unfortunately, currently, they don't work well together for two reasons:
- The requestIdleCallback callback needs to be sync, as this ensures the code is executed in the allotted time. A setTimeout (or other async call) inside of it would execute outside of the allotted idle time.
- promiseDocumentFlushed is with respect to the current frame, so if it is being called from a subframe, it doesn't look all the way up to the parent frame. This means, if it were in a subframe, it wouldn't be able to tell us when the parent frame has just flushed.
- We might be able to get around this with windowUtils.needsFlush, however this suffers from the same limitations with respect to frame boundaries as promiseDocumentFlushed; additionally, it will always return true if, say, the page has a continuous animation.
- You can pass a timeout to requestIdleCallback to ensure that, even if the page is very busy, the code will run after a certain period of time has passed.
- As mentioned above, we don't want to do this generally, as that completely negates the benefits of requestIdleCallback.

Next Steps

In following the Recommended Plan[2], here are the next steps:

Build an actor to be run on promiseDocumentFlushed (or requestAnimationFrame with a setTimeout workaround for non-chrome code) for every http(s):// page which runs a dummy, minimal Fathom ruleset against the DOM.
- This will help us determine of the overhead of Fathom itself is too expensive[3].
- A minimal ruleset might be one that doesn't do any DOM, style or layout access. Perhaps it just returns a random result or pulls in paragraphs and measures their length.
Create an extension or some mechanism to easily monitor how long we spent in our actor for a given page.
- The goal behind this is that we want to quicky surface any pages where performance is bad.
- The best metric to measure performance would be something like a Speed Index[4].

Appendix

References

The Performance discussion was based largely on these restricted access documents on Mozilla's Google Drive:

[2]: Abbreviated Fathom / Smoot Perf Recommendations

Annotations

[3]: What is "too expensive"?
- Ultimately, the decision of what is acceptable needs to come from the Product team as a percentage regression for various types of users. Our job is to come up with a way to approximate measuring that.
- If this were a production application, we could (for now) go with a 2-3% regression in page load time in the worst case (i.e. for a very complex page). Since Smoot is an experiment, this gets tricky.
[4]: More on why the Speed Index is the recommended metric:
- The general idea behind the Speed Index is that, if you display nothing for 4s and then the page pops in fully formed, that's actually worse than showing 90% of it right away.
- While jank approximates this, it is possible jank occurs at the same time as, say, a network request.
- We don't yet have an easy way to measure this in Firefox, but that is a Q3 goal for the Performance team. For now, we could just measure with Performance.now() how long Fathom is running (and/or ./mach browser time); then we can do a more thorough analysis.

erikrose commented 5 years ago

A few other bits and pieces:

Speed Index should land in Q4 in-tree. browsertime works now: mach browsertime. But performance.now() should be good enough for starters. Speed Index and browsertime are better metrics than jank: you could hit "jank" while the browser is waiting for network IO, meaning that jank doesn't matter.
Maybe animations (that run continuously and make things always need a flush) will make getComputedStyle slower. Once we get some solutions implemented, consider making animated pages part of the benchmarks.

biancadanforth commented 4 years ago

Here is a record of my latest notes on next steps for performance, since I was moved off the Fathom team.

References:

How do we know when Fathom is “fast enough”?

Per dothayer in Fathom (Smoot) - Performance review request:

The performance team would need to run pageload tests with Fathom running, and get a set of numbers for how it regresses pageload on various pages under various conditions. At that point we would need some stakeholder (Eric Smythe?) to make a call as to whether that regression is acceptable or not.
Long term goal idea:
- Make an execution framework for Fathom in Firefox. I.e. the developer passes in their ruleset to the framework, and it decides how to chunk up the work and how to execute it on a page.

What to do next:

Profile ReaderMode/isProbablyReaderable in Firefox as a baseline (this is considered “acceptable”). Use a site with a path in its URL per below.
- Per Gijs:
- We have domain blocklists and never run readability on homepages
Profile a virtually empty Fathom (maybe just one rule) to see what the overhead is without any expensive rules.
- We can use the Fathom Smoot demo!
- Make a new feature branch off erikrose/gecko-dev’s smoot-demo branch, called fathom-perf to try out an “empty ruleset”
- Does this by itself exceed the performance regression rule of thumb of no more than 2-3% worst case scenario?
Profile a complete ruleset (e.g. shopping or article) to see how much the ruleset execution adds to the performance regressions.
Assuming Fathom is too slow:
- Profile running all of the Fathom work inside the setTimeout(..., 0) after the requestAnimationFrame(...) call (see dothayer comment in Fathom (Smoot) - Performance review request)
- How accurate is the article ruleset with only DOM information? The shopping ruleset?
- We could do the work off the main thread just like Readability.
- Go over dothayer’s specific implementation recommendations in his comments in Fathom (Smoot) - Performance review request (e.g. use windowUtils.nodesFromRect with aIgnoreRootScrollFrame==true, aFlushLayout==false, and aOnlyVisible==true for isVisible).
- Go over specific implementation recommendations from the Abbreviated Fathom / Smoot Perf Recommendations and comments (e.g. Gijs recommends promiseDocumentFlushed or setTimeout inside a requestAnimationFrame for non-chrome docs instead of DOMContentLoaded for running the ruleset).

Profile ReaderMode/isProbablyReaderable in Firefox as a baseline Context (Reference: Abbreviated Fathom / Smoot Perf Recommendations) Per Gijs, much of Reader Mode’s work is done off the main thread, except for isProbablyReadable. Mconley suggested we use this work as a baseline. Note: Reader Mode doesn’t even run on home pages, we we need to choose a URL with a path.

Plan

Choose a consistent page to load across all such profile tests, Fathom or not and freeze it using FathomFox.
Set up a local HTTP server (fathom-serve) to host a sample page locally. This removes network and other non-reproducible effects.
Use erikrose/gecko-dev’s master branch to profile Reader Mode’s isProbablyReaderable method.

erikrose commented 3 years ago

80% of Fathom's time is spent calling DOM routines. Zibi was telling me that Fluent had the same problem and solved it by turning to DOM bindings, lowering its DOM accesses to direct C++ calls rather than going through the JS layer, which requires the runtime generation of reflection objects (different than X-Rays, which are for insulating content scripts from the page's monkeypatching). We could have Fathom compile rulesets to Rust. Or we could at least compile the parts which do DOM access, run them all at once up front, and ship their results back to JS. Zibi says the communication is fairly expensive. Lots of design space to explore here, obviously.

erikrose commented 3 years ago

Victor got 10x speed improvement by stubbing out getComputedStyle C++-side. He had suspicions that the time was largely going into flushes, but we lack evidence of it. Flushes don't show up in the flame graph. There is a fair amount of XRay overhead. 6% goes to mozilla::ServoStyleSet::ResolveStyleLazily, which could be cured by Emilio's display:none patch. Another 6% goes to xpc::XrayWrapper::getOwnPropertyDescriptor.

erikrose commented 3 years ago

dthayer ported the entirety of isVisible() to C++ and got a 10x speedup based on running it over every node of an Amazon page. This is on top of his using versions of routines that avoid flushes. (The lack of node pruning in this experiment might offset the fact that Pricewise rulesets were only 67% isVisible() and new-password ones only 17%.)

erikrose commented 3 years ago

See also this performance work: https://bugzilla.mozilla.org/show_bug.cgi?id=1709171.

mozilla / fathom