rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
94.9k stars 12.23k forks source link

Optimize enums by niche-filling even when the smaller variants also have data. #46213

Closed eddyb closed 1 year ago

eddyb commented 6 years ago

The current implementation only handles enums of roughly the form:

enum E {
    A(ALeft..., Niche, ARight...),
    B,
    C
}

This can be seen as a special-case, where B and C occupy 0 bytes, of:

enum E {
    A(ALeft..., Niche, ARight...),
    B(B...),
    C(C...)
}

As long as B and C can fit before or after A's Niche, we can still apply the optimization. Also see https://github.com/rust-lang/rfcs/issues/1230#issuecomment-345456535 for the initial description.

Gankra commented 6 years ago

cc me

est31 commented 6 years ago

what is "niche" in this case?

Gankra commented 6 years ago

A niche is a location in the type where some bit patterns aren't valid.

For instance struct Foo(u32, bool, u32) has a niche where the bool is, because only 0 and 1 are valid bools. Option<Foo> can therefore use Foo.1 == 2 to represent None.

est31 commented 6 years ago

@Gankro thanks, that makes sense.

eddyb commented 6 years ago

We probably need better terminology, "invalid" values and "invalid reuse" optimization?

Gankra commented 6 years ago

Here is a test case we currently fail to optimize, but this issue would fix:

Result<Result<u32, u32>, u32>

The compiler sees:

enum Result<Result<u32, u32>, u32>  {
  Ok({ tag: 0u32 | 1u32, payload: u32 }),
  Err(u32),
}

which should lower to:

{ tag: 0u32 | 1u32 | 2u32, payload: u32 }

because the Err payload fits after the niche (Ok.tag).

leonardo-m commented 6 years ago

Before merging this PR I think we should see benchmarks of both memory saved and run-time saved for some real program, like the rust compiler and Servo.

leonardo-m commented 6 years ago

I have yet to see similar benchmaks for #45225 .

Gankra commented 6 years ago

Some slightly more complex cases from webrender that would also benefit from this:

// Use the Opacity.0.tag niche
pub enum FilterOp {
    Blur(f32),
    Brightness(f32),
    Contrast(f32),
    Grayscale(f32),
    HueRotate(f32),
    Invert(f32),
    Opacity(PropertyBinding<f32>, f32),
    Saturate(f32),
    Sepia(f32),
}

pub enum PropertyBinding<T> {
    Value(T),
    Binding(PropertyBindingKey<T>),
}

pub struct PropertyBindingKey<T> {
    pub id: PropertyBindingId,
    _phantom: PhantomData<T>,
}

pub struct PropertyBindingId {
    namespace: IdNamespace,
    uid: u32,
}

pub struct IdNamespace(pub u32);
// use the RoundedRect.1.mode.tag niche
pub enum LocalClip {
    Rect(LayoutRect),
    RoundedRect(LayoutRect, ComplexClipRegion),
}

#[repr(C)]
pub struct ComplexClipRegion {
    pub rect: LayoutRect,
    pub radii: BorderRadius,
    pub mode: ClipMode,
}

#[repr(C)]
pub struct BorderRadius {
    pub top_left: LayoutSize,
    pub top_right: LayoutSize,
    pub bottom_left: LayoutSize,
    pub bottom_right: LayoutSize,
}

#[repr(C)]
pub enum ClipMode {
    Clip,
    ClipOut,
}

#[repr(C)]
struct LayoutRect(f32, f32, f32, f32);

#[repr(C)]
struct LayoutSize(f32, f32);
abonander commented 6 years ago

A simpler case that I think fits this optimization (or a specialization of it as a starting point):

// my use-case, a single value is most common
enum OccupiedSmallVec {
     Single(Foo),
     Multiple(Vec<Foo>),
}

If Foo is 2 usizes or smaller, then OccupiedSmallVec should be the same size as Vec, as if it was implemented via this union:

#![feature(untagged_unions)]
union OccupiedSmallVec {
    // where `single.0 == 0`
    single: (usize, Foo),
    multiple: Vec<Foo>,
}

However, it currently requires a separate tag and padding to align the pointers, wasting basically another usize.

eddyb commented 6 years ago

@abonander Not really, that's the entire optimization, assuming Vec<Foo> starts with a non-null pointer, and you're packing Foo in the leftover space after that pointer.

eddyb commented 6 years ago

From IRC (cc @nox): we could extend this to always compute the size of the tagged layout (which would likely be larger than A itself) and allow using more space around A's fields up to that size, to fit other variants, if we generally want to avoid overlapping variant data where possible.

Then Result<&T, E> and similar would have the layout of (Option<&T>, E).

nox commented 6 years ago

Logs: https://botbot.me/mozilla/rustc/2018-03-20/?msg=98016356&page=1

killercup commented 6 years ago

IIUC, this optimization could also work to make Cow<str> and Cow<[T]> 3 words, right? (cf. discussion on reddit)

nox commented 6 years ago

Yes.

newpavlov commented 5 years ago

Another example which can be optimized is:

pub enum Bar {
    A = 1,
    B = 2,
    C = 3,
}

pub enum Baz {
    D = 4,
    E = 5,
    F = 6,
}

pub enum Foo {
    Bar(Bar),
    Baz(Baz),
}

Currently Foo takes 2 bytes, while it can be trivially represented as 1.

nox commented 5 years ago

@newpavlov That’s way harder to optimise because it becomes more complicated to compute the discriminant of a Foo value.

nagisa commented 5 years ago

Another example which can be optimized is:

This cannot happen, unless we make Bar and Baz types that have size less than byte, which is not something we can do (for variety reasons). Consider that it is possible to take an internal reference into data stored within some variant of Foo (i.e. either &Bar or &Baz) and this reference, when dereferenced should return the original data. To achieve that, the only way to do so is to use some of the "unused" bits to store the active variant of the enum, but that would then require bit twiddling when dereferencing any &Bar or &Baz, even if those weren’t stored within the enum.

RalfJung commented 5 years ago

@nagisa Bar and Baz have disjoint sets of valid values, though. So it could be done.

It doesn't fit the "niche" scheme, though.

newpavlov commented 5 years ago

@nox I think it can be handled as a special case, i.e. if all enum elements are field-less enums, check if tags do not overlap and fit into a minimal repr. Yes, making it more generic will be significantly more involved, but even with field-less constraint this optimization will help a lot for FFI and parsing cases. (currently you often have to keep huge amount of consts, unwieldy enums or sacrifice performance)

@nagisa Ah, I should've explicitly mentioned, I want it foremost for field-less enums. (see linked issue before my first message) More generic handling would be nice, but less important.

nox commented 5 years ago

It makes mem::discriminant way more complex. Currently computing the discriminant of any enum value is at most a comparison and a subtraction. Such optimisations would make the computation more complex, for example what would if let Foo::Bar(_) = foo { ... } compile to?

alercah commented 5 years ago

Optimizing for discriminant extraction seems, to me, far less valuable on average than optimizing for space. Optimizing for the ease of writing an intrisinc for it seems even less valuable. In the provided example, if let Foo::Bar(_) = foo { .. } is if *std::mem::transmute::<&u8>(&foo) <= 3 { ... }, which doesn't even have a subtraction.

alercah commented 5 years ago

I just now realized that std::mem::discriminant is, in general, not required to correspond to the specified discriminant. The Discriminant type does not directly expose the actual value of the discriminant, and like all intrinsics, the underlying intrinsic is unstable. Only the Debug and Hash instances expose it indirectly, and I don't think those should be considered guaranteed to be stable.

In fact, I believe that the only way you can observe the discriminant of a variant of a #[repr(Rust)] enum is via numeric cast, and you can't do that for an enum with fields. So I would propose that, at least for now, explicit discriminants only be permitted on enums with other representations.

oli-obk commented 5 years ago

You need to distinguish between the tag and the discriminant. The tag is an arbitrary bit pattern that stores a value that can be used to extract the discriminant. The discriminant exists even for types like Option<&u32>, which have no tag.

tag extraction really is just a field access. Getting the correct discriminant from that can technically be arbitrarily complex. In practice it is desirable to keep the complexity low though.

alercah commented 5 years ago

Oh, I put that message in the wrong thread, oops.

SimonSapin commented 5 years ago

As long as B and C can fit before or after A's Niche, we can still apply the optimization.

Should struct field reordering be tweaked to try to push a niche to the start or end of a struct layout (if possible without increasing the struct size) in order to make this optimization more effective when that struct is used in an enum?

eddyb commented 5 years ago

Optimizing for discriminant extraction seems, to me, far less valuable on average than optimizing for space.

Note that complicated discriminant extraction can have second-order effects on LLVM otpimizations and can slow down code more than the speedups you get from slightly smaller type sizes.

rhendric commented 5 years ago

I think frunk's Coproducts would also benefit from this optimization; currently, a Coprod!(i32, bool, i64) is bulkier than the equivalent enum (24 bytes vs. 16). If Coproducts can have performance parity with enums, that would make it easier to sell them as a way to experiment with ad-hoc sum types before any of the relevant RFCs land.

mjbshaw commented 3 years ago

One optimization along these lines that might be a simple starting point is when the non-niche variants are ZSTs. For example:

pub enum E1 {
  A(u8, std::num::NonZeroU8),
  B,
}

pub enum E2 {
  A(u8, std::num::NonZeroU8),
  B,
  C,
}

Currently, E1 is already optimized so its size is just 2. But E2 has a size of 3. It's possible for E2 to be compressed to just 2 bytes by reusing A.0 to store the discriminant when A.1 is zero. In other words:

The exact value of A.0 in those last two cases is up to the compiler.

I don't think this would complicate mem::discriminant too much, either.

stephanemagnenat commented 3 years ago

I just stumbled upon that question while exploring porting libnabo, a fast k-nearest-neighbour library with a focus on performance and memory efficiency. Having that feature would allow the same tight memory layout as in C++ but in fully safe Rust, by having the following enum as the key data structure:

enum Node<T> {
    SplitNode {
        cut_dim_and_right_child_index: core::num::NonZeroU32,
        cut_value: T
    },
    LeafNode(u32)
}

for T = f32, LeafNode(u32) holding an index to an array of buckets, which hold the points.

The optimal size for Node<f32> is 8 bytes, currently it is 12 bytes.

scottmcm commented 3 years ago

@stephanemagnenat I'm not sure I follow the suggestion here. How could rust figure out the active variant between the two? f32 supports all bit patterns, so LeafNode(1) can't be distinguished from SplitNode { cut_dim_and_right_child_index: 1, .. } AFAICT.

rhendric commented 3 years ago

NonZeroU32 has a niche; if that field is zero, you have a LeafNode, and otherwise a SplitNode, with the remaining 4 bytes used for either the value of the LeafNode or the cut_value field.

upsuper commented 1 year ago

IIUC, this optimization could also work to make Cow<str> and Cow<[T]> 3 words, right? (cf. discussion on reddit)

So this could potentially regress accessing data of Cow with slice / str.

Currently, for code like

pub fn read_len(x: &std::borrow::Cow<str>) -> usize {
    x.len()
}

the compiler generates optimized code:

example::read_len:
        mov     rax, qword ptr [rdi]
        mov     rax, qword ptr [rdi + 8*rax + 16]
        ret

i.e. without check of the tag.

But if we optimize Cow to take advantage of the non-nullness of ptr, the ptr and len fields would no longer align across the two variants, and it would need to always do an extra tag check for accessing data.

SimonSapin commented 1 year ago

The code above does check the tag since the length field is at different offsets in the two variants, though the 8*rax multiplication makes the offset computation branchless.

With a more compact Cow<str> representation the length will be at the same offset so the code accessing it will be even simpler. However accessing the pointer field will regress and require a branch:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=1b23594473567cc429a98932b9f8592d

playground::read_len:
    mov rax, qword ptr [rdi + 16]
    ret

playground::read_ptr:
    mov rax, qword ptr [rdi]
    test    rax, rax
    jne .LBB1_2
    mov rax, qword ptr [rdi + 8]

.LBB1_2:
    ret

pub union Cow {
    borrowed: (usize, &'static str),
    owned: std::mem::ManuallyDrop<String>,
    owned_repr: (*mut u8, usize, usize), // Hack for demo outside of std
}

impl std::ops::Deref for Cow {
    type Target = str;

    #[inline]
    fn deref(&self) -> &str {
        unsafe {
            if self.owned_repr.0.is_null() {
                self.borrowed.1
            } else {
                &**self.owned
            }
        }
    }
}

pub fn read_len(x: &Cow) -> usize {
    x.len()
}

pub fn read_ptr(x: &Cow) -> *const u8 {
    x.as_bytes().as_ptr()
}
oxalica commented 1 year ago

However accessing the pointer field will regress and require a branch:

playground::read_ptr:
  mov rax, qword ptr [rdi]
  test    rax, rax
  jne .LBB1_2
  mov rax, qword ptr [rdi + 8]

.LBB1_2:
  ret

It's possible to eliminate that branch via CMOVE. Not sure why LLVM refuse to do that currently. But it's definitely not blocked by niche-filling.

EDIT: The LLVM IR do optimize the code into select. Thus I'd assume LLVM thinks CMOV here would perform no better than JNE + MOV, considering it always reads the memory.