nical commented 6 years ago

Sometimes as I go through the code to do something I stumble on unrelated stuff and take notes, and then lose the notes after a while, this time I am pasting them here before I lose them.

Every time I have profiled webrender, allocating and growing vectors has always showed up towards the top of the profiles on the render backend. Also we are quite bottlenecked on memory bandwidth on the render backend (and pretty much anywhere else in Gecko really). Fun fact: on a laptop with a 4k screen, when I reduce the resolution by half (so 4 times less pixels), render backend times are halved even though it's showing the same things just at a different resolution.

Big vectors

we have a lot of big vectors which we could store as segmented vectors instead (basically a vector of vectors, see equivalent gecko type), to avoid reallocating data while preserving good cache behavior during iteration. If a lot of our data structures use a segmented vector with the same chunk size (in bytes), it should make the allocator's job a lot easier too (and if jemalloc fails to take advantage of that, we can easily explicitly pool the chunks).

This should be helpful in the various big vectors in prim_store and for the clip scroll tree.

Small vectors

We also have a lot of small vectors (clip scroll tree children, segments, etc.) which are allocated at the same time and discarded at the same time. Those are a great fit for being stored in a single big (segmented) vector and have primitives stores offsets to that vector. I think we don't even need to store vectors in the case of the clip scroll tree children, we can have the first child be stored right after its parent and store the offset of the next sibling.

AOS/SOA

We could use the structure of arrays pattern in a few places to slim down hot data and avoid loading lots of stuff when we only access a single member. For example moving the screen rect out of the primitive metadata into its own array would avoid loading the entire metadata struct in cache during the batching loop every time when half of the primitives are culled out on average (and a lot more if we ever decide to send larger display ports).

It's hard to tell if these will do huge differences (probably just a few percents) but the fruits are hanging so low, and they all are in stuff that I see a lot in profiles, so just dumping these notes for later if we want to speed up the render backend a bit.

A few (unrelated) numbers

In the process of looking at something I logged a few numbers to get a very rough sense of some of the quantities involved. It's not related to the above but I'm pasting it here anyway otherwise I'll lose it.

page	visible prims	culled prims	clip scroll nodes	clip chains
github.com browsing code	221	446	40	94
theverge.com (front)	188	168	169	146
theverge.com (article)	142	104	81	192
reddit.com (front)	438	380	306	722
reddit.com (a thread)	325	631	174	420
google (search coffee)	170	121	81	191
baidu (search coffee)	470	450	141	371
wikipedia (coffee page)	192	499	79	195
facebook (wall)	411	262	370	920
amazon (search page)	245	325	151	407
youtube (video)	270	315	265	642

Culled prims are primitives with PrimitiveMetadata::screen_rect == None, visible prims are the rest, clip scroll nodes and clip chains are the lengths of two vectors in ClipScrollTree. The amount of stuff in ClipScrollTree::clip_chains can be a bit surprising (not that I have a good intuition of what the numbers should be, but on facebook for example it ends up being a lot more than the total amount of primitives.

kvark commented 6 years ago

Thank you for writing this down!

we have a lot of big vectors which we could store as segmented vectors instead

Or we can re-use the existing allocations more efficiently. I believe too many times we just do Vec::new() or SomeIterator.collect() hoping to get it sorted properly at some point in the future.

We also have a lot of small vectors (clip scroll tree children, segments, etc.) which are allocated at the same time and discarded at the same time. Those are a great fit for being stored in a single big (segmented) vector and have primitives stores offsets to that vector.

Or we can use SmallVec.

For example moving the screen rect out of the primitive metadata into its own array would avoid loading the entire metadata struct in cache during the batching loop every time when half of the primitives are culled out on average (and a lot more if we ever decide to send larger display ports).

This is something we've been thinking about for a year (since last SF All Hands), albeit not for cache performance reasons.

nical commented 6 years ago

Smallvec in the primitive metadata would make it quite a bit bigger while we would gain from making it as small as possible IMO. I'm curious how many primitives get segmented and what the distribution of number of segments looks like on average.

mrobinson commented 6 years ago

The fact that ClipScrollTree::clip_chains is out of proportion with the number of primitives seems important and worth investigating.

gw3583 commented 6 years ago

@mrobinson Yea, that's a very interesting observation. Have you got cycles to do some investigation of that?

mrobinson commented 6 years ago

@gw3583 Sure. I can take a look.

mrobinson commented 6 years ago

@nical Can you comment on how you collected your stats? I would like to use the same procedure myself in order to ensure I'm looking at the same data.

nical commented 6 years ago

for the clip stuff I just printed the size of ClipScrollTree::clip_chains and ClipScrollTree::nodes at the end of update_tree.

nical commented 6 years ago

On the Gecko side, clip chains appear to be defined here: https://searchfox.org/mozilla-central/rev/40577076a6e7a14d143725d0288353c250ea2b16/gfx/layers/wr/ClipManager.cpp#299 which is pretty much invoked for each display item, so if Gecko is creating too many clip chains it probably comes from Gecko (as in non-webrender) clip chains. These seem to be created for scroll frames, the root (pres shell), nsDisplayItem::CreateClipChainIntersection and nsDisplayItem::FuseClipChainUpTo. Perhaps the latter two are gecko-specific optimizations that favor creating more smaller clips and play against us here? (I totally don't know this code).

servo / webrender

CPU optimizaitions notes #2765

Big vectors

Small vectors

AOS/SOA

A few (unrelated) numbers