Optimizing the size of WidgetState

PoignardAzur commented 1 month ago

Right now WidgetState is a pretty chunky struct. VSCode with Rust Analyzer currently tells us:

// size = 296 (0x128), align = 0x8

Yeah. Three and a half widgets are enough to fill a kilobyte. Your usual L1 cache is about 64KB (from a quick Wikipedia check), which is 216 WidgetStates.

That's probably pretty bad. A lot of realistic applications have more than two hundred widgets, and we don't want them to overflow the L1 cache. (For reference, this github issue after two answers has 2467 html elements.)

What can we do to reduce the size? Some ideas:

Bitflags

There are 26 boolean flags.

They take 26 bytes, which we could get down to 4 bytes with bitflags.

However, bitflags are more annoying to use that booleans, and switching to bitflags would only be enough for a <10% size reduction.

Replace f64 values with f32s

We use a lot of f64 types:

size: 2.
origin: 2
window_origin: 2
paint_insets: 4
local_paint_rect: 4
baseline_offset: 1
ime_area: 4
clip_path: 4
translation: 2

That's a total of 25 f64 values, which is 200 bytes, two thirds of our size.

Realistically, virtually all of them could be replaced with f32s with almost no code change, saving us about 100 bytes.

Most of them deal with local coordinates where we expect that most of the range of f64s will be wasted. A widget is never gonna have billions of pixels of paint insets, for instance.

If we ever decide to switch to fractional coordinates, we could store a lot of them as u16, which would be yet more size savings.

Optional fields

The fields ime_area and clip_path are optional. Moreover, fields like paint_insets, baseline_offset and translation aren't optional, but their value is almost always zero.

(cursor too, but the type itself is very lightweight.)

We could store these fields in an Option<Box<...>>, and save memory for 90% of widgets.

Struct-of-array

The most radical change would be to store WidgetStates in a struct-of-arrays pattern.

The problem with WidgetState right now is that a single instance occupies multiple cache lines. That means that accessing two widget states is always going to require touching multiple cache lines.

If we could instead store WidgetStates by field, the cruft would be a lot less of a problem: code iterating over flags would keep the flag array in L1 and wouldn't care that some massive amount of layout bytes is being kept in main memory somewhere.

Switching to struct-of-arrays would be a very ambitious change, probably requiring some radical improvements to the architecture of TreeArena. It would probably impact every single line of code currently using WidgetState.

Benchmark?

Before we do any of that, we probably want to check that Masonry's memory usage is actually bad, is actually bottlenecked on WidgetState, and that L1 misses have an actual impact on our performance. I'm not sure how we'd check that, suggestions are welcome.

I don't expect us to address these problems overnight. We should expect this to be a long-term issue.

jaredoconnell commented 1 month ago

Do you know how values are aligned in Rust? I looked it up and it's platform specific, but alignment could reduce any benefit from switching from f64 to f32 or f16.

PoignardAzur commented 1 month ago

In general aggregates are aligned with their most-aligned field.

So a (f64, f64, f32) is going to be 8-aligned, while a (f32, f32) is going to be 4-aligned. If you switch all the fields of a struct from f64 to f32, alignment doesn't cost you anything.

(Also, just because a struct is N-aligned doesn't mean its fields are N-aligned. So for instance WidgetState is 8-aligned, but individual bool flags of WidgetState are 1-aligned.)

Philipp-M commented 1 month ago

Also Rust is able to do some optimizations, so for Option<Box<T>> it can use the property that Box<T> is never a null pointer as it's wrapping a NonNull<T>, so the Option doesn't take extra space, generally it can pack enums with two variants when using NonNull or NonZero to the same size as T (see also here), also there's #[repr(packed)] but I don't think we want to use that... I likely have missed something else.

That said, I'd say benching and profiling first, before attempting potentially premature-optimizations (while I'm currently doing pretty much that in a different context^^). A struct-of-array approach may be interesting, though I think we would have to check how it's accessed, this can likely not really result in an improvement of performance, this is still a comparatively small struct (e.g. when size, origin and translate is often used together, having them in separate arrays likely won't help).

My guess is, that we have greater potential for optimization elsewhere for a long time.

test3211234 commented 3 weeks ago

I think (?) the struct-of-lists pattern would make it so that WidgetIds no longer have a clear owner/corresponding Widget (not too sure what value a list of ids has). This would make it so Widget implementations themselves would have to store their WidgetId and have a function like .id() that returns it. Do you think that's worth it? Of course, I could be mistaken.

PoignardAzur commented 3 weeks ago

The current pattern is that WidgetPod stores the WidgetId, which points to both the widget implementation and the WidgetState. So the indirection already happens and already needs to be accommodated in the code, but widget impls don't need to store a WidgetId.

test3211234 commented 2 weeks ago

The update_widget_tree function only would have like 30 ‘.get’s for each widget. If I remember correctly each call is around 500 ps. Is that acceptable? Not saying it’s bad just wondering.

PoignardAzur commented 2 weeks ago

Not sure what you're asking.

test3211234 commented 2 weeks ago

I mean, you would have to do a lot of array indexing, once for each state you want to access. The performance cost might add up? In reality you'll probably use SecondaryMap or similar for each state field I'm guessing.

PoignardAzur commented 2 weeks ago

The short answer is we'd have to test and measure it to know for sure.

The long answer is that accessing N fields in M items requires as many load instructions in struct-of-array and array-of-struct, and that most of the function calls involved would likely get inlined in optimized builds. The difference is in how the load patterns deal with the CPU caches.

test3211234 commented 2 weeks ago

Can you give me an example case of where you would use this pattern to improve performance theoretically?

When I look at passes, the way I think of it is we could store bitflags in a SecondaryMap and then iterate through widget keys to update those flags. A potential problem is, Slot/SecondaryMap iterate in arbitrary order. I still need to find the best arena crate 🤔

linebender / xilem