Open nical opened 6 years ago
Thank you for writing this down!
we have a lot of big vectors which we could store as segmented vectors instead
Or we can re-use the existing allocations more efficiently. I believe too many times we just do Vec::new()
or SomeIterator.collect()
hoping to get it sorted properly at some point in the future.
We also have a lot of small vectors (clip scroll tree children, segments, etc.) which are allocated at the same time and discarded at the same time. Those are a great fit for being stored in a single big (segmented) vector and have primitives stores offsets to that vector.
Or we can use SmallVec.
For example moving the screen rect out of the primitive metadata into its own array would avoid loading the entire metadata struct in cache during the batching loop every time when half of the primitives are culled out on average (and a lot more if we ever decide to send larger display ports).
This is something we've been thinking about for a year (since last SF All Hands), albeit not for cache performance reasons.
Smallvec in the primitive metadata would make it quite a bit bigger while we would gain from making it as small as possible IMO. I'm curious how many primitives get segmented and what the distribution of number of segments looks like on average.
The fact that ClipScrollTree::clip_chains
is out of proportion with the number of primitives seems important and worth investigating.
@mrobinson Yea, that's a very interesting observation. Have you got cycles to do some investigation of that?
@gw3583 Sure. I can take a look.
@nical Can you comment on how you collected your stats? I would like to use the same procedure myself in order to ensure I'm looking at the same data.
for the clip stuff I just printed the size of ClipScrollTree::clip_chains
and ClipScrollTree::nodes
at the end of update_tree.
On the Gecko side, clip chains appear to be defined here: https://searchfox.org/mozilla-central/rev/40577076a6e7a14d143725d0288353c250ea2b16/gfx/layers/wr/ClipManager.cpp#299 which is pretty much invoked for each display item, so if Gecko is creating too many clip chains it probably comes from Gecko (as in non-webrender) clip chains. These seem to be created for scroll frames, the root (pres shell), nsDisplayItem::CreateClipChainIntersection and nsDisplayItem::FuseClipChainUpTo. Perhaps the latter two are gecko-specific optimizations that favor creating more smaller clips and play against us here? (I totally don't know this code).
Sometimes as I go through the code to do something I stumble on unrelated stuff and take notes, and then lose the notes after a while, this time I am pasting them here before I lose them.
Every time I have profiled webrender, allocating and growing vectors has always showed up towards the top of the profiles on the render backend. Also we are quite bottlenecked on memory bandwidth on the render backend (and pretty much anywhere else in Gecko really). Fun fact: on a laptop with a 4k screen, when I reduce the resolution by half (so 4 times less pixels), render backend times are halved even though it's showing the same things just at a different resolution.
Big vectors
we have a lot of big vectors which we could store as segmented vectors instead (basically a vector of vectors, see equivalent gecko type), to avoid reallocating data while preserving good cache behavior during iteration. If a lot of our data structures use a segmented vector with the same chunk size (in bytes), it should make the allocator's job a lot easier too (and if jemalloc fails to take advantage of that, we can easily explicitly pool the chunks).
This should be helpful in the various big vectors in prim_store and for the clip scroll tree.
Small vectors
We also have a lot of small vectors (clip scroll tree children, segments, etc.) which are allocated at the same time and discarded at the same time. Those are a great fit for being stored in a single big (segmented) vector and have primitives stores offsets to that vector. I think we don't even need to store vectors in the case of the clip scroll tree children, we can have the first child be stored right after its parent and store the offset of the next sibling.
AOS/SOA
We could use the structure of arrays pattern in a few places to slim down hot data and avoid loading lots of stuff when we only access a single member. For example moving the screen rect out of the primitive metadata into its own array would avoid loading the entire metadata struct in cache during the batching loop every time when half of the primitives are culled out on average (and a lot more if we ever decide to send larger display ports).
It's hard to tell if these will do huge differences (probably just a few percents) but the fruits are hanging so low, and they all are in stuff that I see a lot in profiles, so just dumping these notes for later if we want to speed up the render backend a bit.
A few (unrelated) numbers
In the process of looking at something I logged a few numbers to get a very rough sense of some of the quantities involved. It's not related to the above but I'm pasting it here anyway otherwise I'll lose it.
Culled prims are primitives with
PrimitiveMetadata::screen_rect == None
, visible prims are the rest, clip scroll nodes and clip chains are the lengths of two vectors inClipScrollTree
. The amount of stuff inClipScrollTree::clip_chains
can be a bit surprising (not that I have a good intuition of what the numbers should be, but on facebook for example it ends up being a lot more than the total amount of primitives.