Open erikrose opened 5 years ago
I've been talking with a number of Firefox/performance engineers (Gijs, Emilio, Rob, Greg) about this and have some useful information to share.
Most of the conversations centered around improving Fathom as-is (blocking the main thread) in Price Tracker.
TL;DR: isVisible
, which will likely be a common rule in many Fathom applications, accounts for the majority of the 460 ms of Fathom-related jank (67%). It may be possible to reduce this overall Fathom jank by as much as 374 ms (81%) by reducing style and DOM property accesses in isVisible
, but this solution requires a privileged context.
This case study helped me to develop some general performance strategies shared below.
requestIdleCallback
note
)getComputedStyle
or getBoundingClientRect
) by using the Intersection Observer API and/or promiseDocumentFlushed
Reader mode actually serializes the DOM and ships it off to a worker where it gets reparsed, but that loses style information and is also slow
You could use a separate process, but then you'd have to do a second load of the page in that other process, and that will probably make things even slower, plus there's no guarantee the result would be the same.
Accessing DOM and layout information off-main-thread is not trivial. In the style system we access DOM information off-main-thread, but while the main-thread is paused.
isVisible
for big performance wins in Firefox applications?I'd note that it should be possible also to push in the CSS Working Group to get something like elementFromPoint{s}
to take a set of options to ignore the viewport clip or something.
That should make it work everywhere. Apparently there are a few old requests for that: https://lists.w3.org/Archives/Public/www-style/2012Oct/0683.html
cssom-view is unmaintained atm, but I'd be happy to help out there.
I filed https://github.com/w3c/csswg-drafts/issues/4122 to try to standardize something that would've helped here.
Per Emilio:
In practice, a good rule of thumb is something like "if there's something that depends on up-to date layout information, that flushes layout". Same for style and paint.
It is also pretty easy to test. Create a big page (or open one, like https://html.spec.whatwg.org/), and write a loop like:
var start = performance.now(); for (var i = 0; i < 1000 /* insert/remove zeros as needed */; i++) { document.documentElement.style.display = i % 2 == 0 ? "none" : ""; theApiYouWantToTest(); } console.log(performance.now() - start);
If theApiYouWantToTest() flushes layout, you'll see it takes massively longer than if it doesn't. Compare putting something simple like
document.documentElement.style.color
(which doesn't need to update the style of the page) withdocument.documentElement.getBoundingClientRect()
(which updates layout).
Note that the document.documentElement.style.display = i % 2 == 0 ? "none" : "";
is to ensure that a layout style changes in each iteration. This makes the flush as expensive as possible and the time difference between something that does and doesn't trigger a flush very apparent.
In reality, it's possible to use something like getBoundingClientRect
that causes a flush without changing any layout styles (like the display
value), so the cost of the flush is much reduced with an earlier return (as might be the case in isVisible
, which may be why the jank was dominated by XRays rather than layout work (see the last section)).
As noted here, the original, sync implementation of isVisible
in Price Tracker caused the majority of Fathom-related jank on a sample page. What I discovered yesterday is that, if I only change when isVisible
is executed (i.e. run it asynchronously immediately after a paint, so as not to trigger unnecessary layout flushes), there was a 42% reduction in Fathom-related jank!
This improvement would be on top of any performance improvements to isVisible
itself.
One less-than-ideal option is to add a new, async pre-processing step to Fathom that runs before the ruleset is executed. This step would only run if Fathom's isVisible
function is being used in the ruleset.
The best option, however, is for Fathom itself to be made async. Something like:
const results = await rules.against(document);
...and inside the ruleset where it uses isVisible
, its execution would pause until it could be run right after a paint.
Making Fathom async will enable further concurrency (see item 3) as well.
@erikrose , Should we break out "Make Fathom async" into a separate issue for discussion?
@erikrose , Should we break out "Make Fathom async" into a separate issue for discussion?
Yes, please. Is seems to me it should be possible to take a middle-of-the-road approach as well: call the existing synchronous Fathom routines in a requestAnimationFrame() callback, thus calling geometry-using routines like isVisible() at the optimal time without requiring a rewrite of the Fathom execution machinery. Correct?
On the same subject, I do notice that requestAnimationFrame() itself probably ceases to call its callbacks on background or otherwise invisible tabs. Whether this is a problem depends on the application, but it's something to keep in mind.
I filed a Performance Review Request[1] outlining the high level details for the Fathom/Smoot project, which is expected to be the first Firefox application of Fathom, and Erik and I met with dothayer from the performance team last week.
promiseDocumentFlushed
(or the unprivileged requestAnimationFrame
/setTimeout
workaround) monitors style changes for the current frame, and it is possible to do all of Fathom's synchronous work in a single promiseDocumentFlushed
callback.
isVisible
experiments (second section) is unnecessary. This is because when the callback is executing the synchronous code, it is running in the main thread, and it cannot be interrupted (say by something else that triggers another flush) until it completes.promiseDocumentFlushed
optimization only eliminates flushes, which is probably not the main performance problem. The main problems are more likely to be XRay work and the fact that Fathom will likely take too long to execute in one synchronous chunk. Therefore, a separate and more challenging optimization would be to amortize the work; for this requestIdleCallback
seems best suited.promiseDocumentFlushed
and requestIdleCallback
is that the former would only eliminate unnecessary layout flushes from the Fathom work. The latter actually executes the code at idle times in the browser.
requestIdleCallback
callback needs to be sync, as this ensures the code is executed in the allotted time. A setTimeout
(or other async call) inside of it would execute outside of the allotted idle time.promiseDocumentFlushed
is with respect to the current frame, so if it is being called from a subframe, it doesn't look all the way up to the parent frame. This means, if it were in a subframe, it wouldn't be able to tell us when the parent frame has just flushed.windowUtils.needsFlush
, however this suffers from the same limitations with respect to frame boundaries as promiseDocumentFlushed
; additionally, it will always return true
if, say, the page has a continuous animation.requestIdleCallback
to ensure that, even if the page is very busy, the code will run after a certain period of time has passed.requestIdleCallback
.In following the Recommended Plan[2], here are the next steps:
promiseDocumentFlushed
(or requestAnimationFrame
with a setTimeout
workaround for non-chrome code) for every http(s):// page which runs a dummy, minimal Fathom ruleset against the DOM.
The Performance discussion was based largely on these restricted access documents on Mozilla's Google Drive:
Performance.now()
how long Fathom is running (and/or ./mach browser time
); then we can do a more thorough analysis.A few other bits and pieces:
mach browsertime
. But performance.now()
should be good enough for starters. Speed Index and browsertime are better metrics than jank: you could hit "jank" while the browser is waiting for network IO, meaning that jank doesn't matter.Here is a record of my latest notes on next steps for performance, since I was moved off the Fathom team.
References:
How do we know when Fathom is “fast enough”?
The performance team would need to run pageload tests with Fathom running, and get a set of numbers for how it regresses pageload on various pages under various conditions. At that point we would need some stakeholder (Eric Smythe?) to make a call as to whether that regression is acceptable or not.
What to do next:
isProbablyReaderable
in Firefox as a baseline (this is considered “acceptable”). Use a site with a path in its URL per below.
We have domain blocklists and never run readability on homepages
erikrose/gecko-dev
’s smoot-demo
branch, called fathom-perf
to try out an “empty ruleset”promiseDocumentFlushed
or setTimeout
inside a requestAnimationFrame
for non-chrome docs instead of DOMContentLoaded
for running the ruleset).Profile ReaderMode/isProbablyReaderable
in Firefox as a baseline
Context
(Reference: Abbreviated Fathom / Smoot Perf Recommendations) Per Gijs, much of Reader Mode’s work is done off the main thread, except for isProbablyReadable
. Mconley suggested we use this work as a baseline. Note: Reader Mode doesn’t even run on home pages, we we need to choose a URL with a path.
Plan
fathom-serve
) to host a sample page locally. This removes network and other non-reproducible effects.erikrose/gecko-dev
’s master
branch to profile Reader Mode’s isProbablyReaderable
method.80% of Fathom's time is spent calling DOM routines. Zibi was telling me that Fluent had the same problem and solved it by turning to DOM bindings, lowering its DOM accesses to direct C++ calls rather than going through the JS layer, which requires the runtime generation of reflection objects (different than X-Rays, which are for insulating content scripts from the page's monkeypatching). We could have Fathom compile rulesets to Rust. Or we could at least compile the parts which do DOM access, run them all at once up front, and ship their results back to JS. Zibi says the communication is fairly expensive. Lots of design space to explore here, obviously.
Victor got 10x speed improvement by stubbing out getComputedStyle C++-side. He had suspicions that the time was largely going into flushes, but we lack evidence of it. Flushes don't show up in the flame graph. There is a fair amount of XRay overhead. 6% goes to mozilla::ServoStyleSet::ResolveStyleLazily
, which could be cured by Emilio's display:none
patch. Another 6% goes to xpc::XrayWrapper::getOwnPropertyDescriptor
.
dthayer ported the entirety of isVisible() to C++ and got a 10x speedup based on running it over every node of an Amazon page. This is on top of his using versions of routines that avoid flushes. (The lack of node pruning in this experiment might offset the fact that Pricewise rulesets were only 67% isVisible() and new-password ones only 17%.)
See also this performance work: https://bugzilla.mozilla.org/show_bug.cgi?id=1709171.
Because it needs access to the DOM, Fathom currently wants to run on the main thread. Unless run in response to a user action, it can create a little jank, taking upwards of 40ms to run a ruleset, so we dare not run it, for example, on every page load. Can we speed it up or find a way to run it offthread?
One approach is to make Fathom run faster. About 80% of its runtime on the Pricewise ruleset is spent in DOM routines. Those do a lot of flushing of layout and other pipeline stages, redoing calculations unnecessarily. Is this a major source of wasted time? Measure. Are there lower-level hooks we can use? (
window.windowUtils.getBoundsWithoutFlushing()
might be a faster way of getting element size, for example. mattn suggested it. Also see https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Performance_best_practices_for_Firefox_fe_engineers, which has, for instance, routines to get the window size and scroll without flushing things.) Other ideas?Can we run Fathom offthread without losing access to too much signal? Reader Mode currently serializes the markup (only) and ships it offthread to parse. Could we do something like that but also apply CSS ourselves offthread? Would that preserve enough signal for most rulesets? Would it be too slow or battery-hungry on 2-core mobile devices?
This bug is done when we can blithely run a Fathom ruleset on every Firefox page load without concern for dragging down the UX.