linebender / xilem

An experimental Rust native UI framework
Apache License 2.0
3.67k stars 115 forks source link

Optimizing the size of WidgetState #706

Open PoignardAzur opened 2 weeks ago

PoignardAzur commented 2 weeks ago

Right now WidgetState is a pretty chunky struct. VSCode with Rust Analyzer currently tells us:

// size = 296 (0x128), align = 0x8

Yeah. Three and a half widgets are enough to fill a kilobyte. Your usual L1 cache is about 64KB (from a quick Wikipedia check), which is 216 WidgetStates.

That's probably pretty bad. A lot of realistic applications have more than two hundred widgets, and we don't want them to overflow the L1 cache. (For reference, this github issue after two answers has 2467 html elements.)

What can we do to reduce the size? Some ideas:

Bitflags

There are 26 boolean flags.

They take 26 bytes, which we could get down to 4 bytes with bitflags.

However, bitflags are more annoying to use that booleans, and switching to bitflags would only be enough for a <10% size reduction.

Replace f64 values with f32s

We use a lot of f64 types:

That's a total of 25 f64 values, which is 200 bytes, two thirds of our size.

Realistically, virtually all of them could be replaced with f32s with almost no code change, saving us about 100 bytes.

Most of them deal with local coordinates where we expect that most of the range of f64s will be wasted. A widget is never gonna have billions of pixels of paint insets, for instance.

If we ever decide to switch to fractional coordinates, we could store a lot of them as u16, which would be yet more size savings.

Optional fields

The fields ime_area and clip_path are optional. Moreover, fields like paint_insets, baseline_offset and translation aren't optional, but their value is almost always zero.

(cursor too, but the type itself is very lightweight.)

We could store these fields in an Option<Box<...>>, and save memory for 90% of widgets.

Struct-of-array

The most radical change would be to store WidgetStates in a struct-of-arrays pattern.

The problem with WidgetState right now is that a single instance occupies multiple cache lines. That means that accessing two widget states is always going to require touching multiple cache lines.

If we could instead store WidgetStates by field, the cruft would be a lot less of a problem: code iterating over flags would keep the flag array in L1 and wouldn't care that some massive amount of layout bytes is being kept in main memory somewhere.

Switching to struct-of-arrays would be a very ambitious change, probably requiring some radical improvements to the architecture of TreeArena. It would probably impact every single line of code currently using WidgetState.

Benchmark?

Before we do any of that, we probably want to check that Masonry's memory usage is actually bad, is actually bottlenecked on WidgetState, and that L1 misses have an actual impact on our performance. I'm not sure how we'd check that, suggestions are welcome.

I don't expect us to address these problems overnight. We should expect this to be a long-term issue.

jaredoconnell commented 2 weeks ago

Do you know how values are aligned in Rust? I looked it up and it's platform specific, but alignment could reduce any benefit from switching from f64 to f32 or f16.

PoignardAzur commented 2 weeks ago

In general aggregates are aligned with their most-aligned field.

So a (f64, f64, f32) is going to be 8-aligned, while a (f32, f32) is going to be 4-aligned. If you switch all the fields of a struct from f64 to f32, alignment doesn't cost you anything.

(Also, just because a struct is N-aligned doesn't mean its fields are N-aligned. So for instance WidgetState is 8-aligned, but individual bool flags of WidgetState are 1-aligned.)

Philipp-M commented 2 weeks ago

Also Rust is able to do some optimizations, so for Option<Box<T>> it can use the property that Box<T> is never a null pointer as it's wrapping a NonNull<T>, so the Option doesn't take extra space, generally it can pack enums with two variants when using NonNull or NonZero to the same size as T (see also here), also there's #[repr(packed)] but I don't think we want to use that... I likely have missed something else.

That said, I'd say benching and profiling first, before attempting potentially premature-optimizations (while I'm currently doing pretty much that in a different context^^). A struct-of-array approach may be interesting, though I think we would have to check how it's accessed, this can likely not really result in an improvement of performance, this is still a comparatively small struct (e.g. when size, origin and translate is often used together, having them in separate arrays likely won't help).

My guess is, that we have greater potential for optimization elsewhere for a long time.