Closed eddyb closed 1 year ago
cc me
what is "niche" in this case?
A niche is a location in the type where some bit patterns aren't valid.
For instance struct Foo(u32, bool, u32)
has a niche where the bool is, because only 0 and 1 are valid bools. Option<Foo>
can therefore use Foo.1 == 2
to represent None
.
@Gankro thanks, that makes sense.
We probably need better terminology, "invalid" values and "invalid reuse" optimization?
Here is a test case we currently fail to optimize, but this issue would fix:
Result<Result<u32, u32>, u32>
The compiler sees:
enum Result<Result<u32, u32>, u32> {
Ok({ tag: 0u32 | 1u32, payload: u32 }),
Err(u32),
}
which should lower to:
{ tag: 0u32 | 1u32 | 2u32, payload: u32 }
because the Err payload fits after the niche (Ok.tag
).
Before merging this PR I think we should see benchmarks of both memory saved and run-time saved for some real program, like the rust compiler and Servo.
I have yet to see similar benchmaks for #45225 .
Some slightly more complex cases from webrender that would also benefit from this:
// Use the Opacity.0.tag niche
pub enum FilterOp {
Blur(f32),
Brightness(f32),
Contrast(f32),
Grayscale(f32),
HueRotate(f32),
Invert(f32),
Opacity(PropertyBinding<f32>, f32),
Saturate(f32),
Sepia(f32),
}
pub enum PropertyBinding<T> {
Value(T),
Binding(PropertyBindingKey<T>),
}
pub struct PropertyBindingKey<T> {
pub id: PropertyBindingId,
_phantom: PhantomData<T>,
}
pub struct PropertyBindingId {
namespace: IdNamespace,
uid: u32,
}
pub struct IdNamespace(pub u32);
// use the RoundedRect.1.mode.tag niche
pub enum LocalClip {
Rect(LayoutRect),
RoundedRect(LayoutRect, ComplexClipRegion),
}
#[repr(C)]
pub struct ComplexClipRegion {
pub rect: LayoutRect,
pub radii: BorderRadius,
pub mode: ClipMode,
}
#[repr(C)]
pub struct BorderRadius {
pub top_left: LayoutSize,
pub top_right: LayoutSize,
pub bottom_left: LayoutSize,
pub bottom_right: LayoutSize,
}
#[repr(C)]
pub enum ClipMode {
Clip,
ClipOut,
}
#[repr(C)]
struct LayoutRect(f32, f32, f32, f32);
#[repr(C)]
struct LayoutSize(f32, f32);
A simpler case that I think fits this optimization (or a specialization of it as a starting point):
// my use-case, a single value is most common
enum OccupiedSmallVec {
Single(Foo),
Multiple(Vec<Foo>),
}
If Foo
is 2 usizes
or smaller, then OccupiedSmallVec
should be the same size as Vec
, as if it was implemented via this union:
#![feature(untagged_unions)]
union OccupiedSmallVec {
// where `single.0 == 0`
single: (usize, Foo),
multiple: Vec<Foo>,
}
However, it currently requires a separate tag and padding to align the pointers, wasting basically another usize
.
@abonander Not really, that's the entire optimization, assuming Vec<Foo>
starts with a non-null pointer, and you're packing Foo
in the leftover space after that pointer.
From IRC (cc @nox): we could extend this to always compute the size of the tagged layout (which would likely be larger than A
itself) and allow using more space around A
's fields up to that size, to fit other variants, if we generally want to avoid overlapping variant data where possible.
Then Result<&T, E>
and similar would have the layout of (Option<&T>, E)
.
IIUC, this optimization could also work to make Cow<str>
and Cow<[T]>
3 words, right? (cf. discussion on reddit)
Yes.
Another example which can be optimized is:
pub enum Bar {
A = 1,
B = 2,
C = 3,
}
pub enum Baz {
D = 4,
E = 5,
F = 6,
}
pub enum Foo {
Bar(Bar),
Baz(Baz),
}
Currently Foo
takes 2 bytes, while it can be trivially represented as 1.
@newpavlov That’s way harder to optimise because it becomes more complicated to compute the discriminant of a Foo value.
Another example which can be optimized is:
This cannot happen, unless we make Bar
and Baz
types that have size less than byte, which is not something we can do (for variety reasons). Consider that it is possible to take an internal reference into data stored within some variant of Foo
(i.e. either &Bar
or &Baz
) and this reference, when dereferenced should return the original data. To achieve that, the only way to do so is to use some of the "unused" bits to store the active variant of the enum, but that would then require bit twiddling when dereferencing any &Bar
or &Baz
, even if those weren’t stored within the enum.
@nagisa Bar
and Baz
have disjoint sets of valid values, though. So it could be done.
It doesn't fit the "niche" scheme, though.
@nox I think it can be handled as a special case, i.e. if all enum elements are field-less enums, check if tags do not overlap and fit into a minimal repr. Yes, making it more generic will be significantly more involved, but even with field-less constraint this optimization will help a lot for FFI and parsing cases. (currently you often have to keep huge amount of const
s, unwieldy enums or sacrifice performance)
@nagisa Ah, I should've explicitly mentioned, I want it foremost for field-less enums. (see linked issue before my first message) More generic handling would be nice, but less important.
It makes mem::discriminant
way more complex. Currently computing the discriminant of any enum value is at most a comparison and a subtraction. Such optimisations would make the computation more complex, for example what would if let Foo::Bar(_) = foo { ... }
compile to?
Optimizing for discriminant extraction seems, to me, far less valuable on average than optimizing for space. Optimizing for the ease of writing an intrisinc for it seems even less valuable. In the provided example, if let Foo::Bar(_) = foo { .. }
is if *std::mem::transmute::<&u8>(&foo) <= 3 { ... }
, which doesn't even have a subtraction.
I just now realized that std::mem::discriminant
is, in general, not required to correspond to the specified discriminant. The Discriminant
type does not directly expose the actual value of the discriminant, and like all intrinsics, the underlying intrinsic is unstable. Only the Debug
and Hash
instances expose it indirectly, and I don't think those should be considered guaranteed to be stable.
In fact, I believe that the only way you can observe the discriminant of a variant of a #[repr(Rust)]
enum is via numeric cast, and you can't do that for an enum with fields. So I would propose that, at least for now, explicit discriminants only be permitted on enums with other representations.
You need to distinguish between the tag and the discriminant. The tag is an arbitrary bit pattern that stores a value that can be used to extract the discriminant. The discriminant exists even for types like Option<&u32>
, which have no tag.
tag extraction really is just a field access. Getting the correct discriminant from that can technically be arbitrarily complex. In practice it is desirable to keep the complexity low though.
Oh, I put that message in the wrong thread, oops.
As long as B and C can fit before or after A's Niche, we can still apply the optimization.
Should struct field reordering be tweaked to try to push a niche to the start or end of a struct layout (if possible without increasing the struct size) in order to make this optimization more effective when that struct is used in an enum?
Optimizing for discriminant extraction seems, to me, far less valuable on average than optimizing for space.
Note that complicated discriminant extraction can have second-order effects on LLVM otpimizations and can slow down code more than the speedups you get from slightly smaller type sizes.
I think frunk's Coproducts would also benefit from this optimization; currently, a Coprod!(i32, bool, i64)
is bulkier than the equivalent enum (24 bytes vs. 16). If Coproducts can have performance parity with enums, that would make it easier to sell them as a way to experiment with ad-hoc sum types before any of the relevant RFCs land.
One optimization along these lines that might be a simple starting point is when the non-niche variants are ZSTs. For example:
pub enum E1 {
A(u8, std::num::NonZeroU8),
B,
}
pub enum E2 {
A(u8, std::num::NonZeroU8),
B,
C,
}
Currently, E1
is already optimized so its size is just 2. But E2
has a size of 3. It's possible for E2
to be compressed to just 2 bytes by reusing A.0
to store the discriminant when A.1
is zero. In other words:
A.1 != 0
⇒ discriminant is A
.A.0 == 0 && A.1 == 0
⇒ discriminant is B
.A.0 == 1 && A.1 == 0
⇒ discriminant is C
.The exact value of A.0
in those last two cases is up to the compiler.
I don't think this would complicate mem::discriminant
too much, either.
I just stumbled upon that question while exploring porting libnabo, a fast k-nearest-neighbour library with a focus on performance and memory efficiency. Having that feature would allow the same tight memory layout as in C++ but in fully safe Rust, by having the following enum as the key data structure:
enum Node<T> {
SplitNode {
cut_dim_and_right_child_index: core::num::NonZeroU32,
cut_value: T
},
LeafNode(u32)
}
for T = f32
, LeafNode(u32)
holding an index to an array of buckets, which hold the points.
The optimal size for Node<f32>
is 8 bytes, currently it is 12 bytes.
@stephanemagnenat I'm not sure I follow the suggestion here. How could rust figure out the active variant between the two? f32
supports all bit patterns, so LeafNode(1)
can't be distinguished from SplitNode { cut_dim_and_right_child_index: 1, .. }
AFAICT.
NonZeroU32
has a niche; if that field is zero, you have a LeafNode
, and otherwise a SplitNode
, with the remaining 4 bytes used for either the value of the LeafNode
or the cut_value
field.
IIUC, this optimization could also work to make
Cow<str>
andCow<[T]>
3 words, right? (cf. discussion on reddit)
So this could potentially regress accessing data of Cow
with slice / str.
Currently, for code like
pub fn read_len(x: &std::borrow::Cow<str>) -> usize {
x.len()
}
the compiler generates optimized code:
example::read_len:
mov rax, qword ptr [rdi]
mov rax, qword ptr [rdi + 8*rax + 16]
ret
i.e. without check of the tag.
But if we optimize Cow
to take advantage of the non-nullness of ptr, the ptr and len fields would no longer align across the two variants, and it would need to always do an extra tag check for accessing data.
The code above does check the tag since the length field is at different offsets in the two variants, though the 8*rax
multiplication makes the offset computation branchless.
With a more compact Cow<str>
representation the length will be at the same offset so the code accessing it will be even simpler. However accessing the pointer field will regress and require a branch:
playground::read_len:
mov rax, qword ptr [rdi + 16]
ret
playground::read_ptr:
mov rax, qword ptr [rdi]
test rax, rax
jne .LBB1_2
mov rax, qword ptr [rdi + 8]
.LBB1_2:
ret
pub union Cow {
borrowed: (usize, &'static str),
owned: std::mem::ManuallyDrop<String>,
owned_repr: (*mut u8, usize, usize), // Hack for demo outside of std
}
impl std::ops::Deref for Cow {
type Target = str;
#[inline]
fn deref(&self) -> &str {
unsafe {
if self.owned_repr.0.is_null() {
self.borrowed.1
} else {
&**self.owned
}
}
}
}
pub fn read_len(x: &Cow) -> usize {
x.len()
}
pub fn read_ptr(x: &Cow) -> *const u8 {
x.as_bytes().as_ptr()
}
However accessing the pointer field will regress and require a branch:
playground::read_ptr: mov rax, qword ptr [rdi] test rax, rax jne .LBB1_2 mov rax, qword ptr [rdi + 8] .LBB1_2: ret
It's possible to eliminate that branch via CMOVE
. Not sure why LLVM refuse to do that currently.
But it's definitely not blocked by niche-filling.
EDIT: The LLVM IR do optimize the code into select
. Thus I'd assume LLVM thinks CMOV
here would perform no better than JNE
+ MOV
, considering it always reads the memory.
The current implementation only handles enums of roughly the form:
This can be seen as a special-case, where B and C occupy 0 bytes, of:
As long as
B
andC
can fit before or afterA
'sNiche
, we can still apply the optimization. Also see https://github.com/rust-lang/rfcs/issues/1230#issuecomment-345456535 for the initial description.