Idea: guaranteed-zero padding for niches

RalfJung commented 4 years ago

As suggested here, one option to gain more layout optimizations would be (either for all repr(Rust) types or with per-type opt-in) to use a different kind of padding.

Right now, padding bytes may contain arbitrary data, and can be omitted when copying as they are "unstable" in a typed copy. Instead, we could also imagine a kind of padding ("zero-padding") that is guaranteed to be zero. This has the disadvantage that padding bytes need to be initialized on struct creation and copied when the struct is copied, but it has the advantage that padding bytes can become "niches" for layout optimizations.

LLVM's padding is of the "unstable" kind, but we could just add explicit fields ourselves (fields that are guaranteed to be zero) when translating zero-padded types to LLVM.

Cc @eddyb

eddyb commented 4 years ago

LLVM's padding is of the "unstable" kind, but we could just add explicit fields ourselves (fields that are guaranteed to be zero) when translating zero-padded types to LLVM.

LLVM has no concept of "padding" outside of "first-class aggregates", which we only use for multiple (well, 2) return values (via ScalarPair), and nothing else (no values in functions, no global initializers). And really, it's basically like loading/storing the fields individually.

As long a type is Abi::Aggregate, and it's not copied by explicit field-by-field accesses, LLVM only sees a memcpy (so size and alignment), nothing else.

EDIT: oh and there's passing values around in calls - that's also not LLVM, it's us, we'd have to make sure we don't apply architecture-specific call ABI rules (for e.g. extern "C" fns) that introspect the field structure of the type, and treat it like an opaque blob instead.

RalfJung commented 4 years ago

LLVM has no concept of "padding" outside of "first-class aggregates", which we only use for multiple (well, 2) return values (via ScalarPair), and nothing else (no values in functions, no global initializers). And really, it's basically like loading/storing the fields individually.

I don't know what you mean. When I define a struct/union in LLVM, won't it add padding in between the fields?

LLVM is permitted to turn a copy of struct-type into a field-by-field copy, and AFAIK we had examples where it did that? This is the kind of transformation that becomes illegal with "zero-padding" (unless we somehow already know that the target memory is zero, e.g. because it is already valid at the given type).

bjorn3 commented 4 years ago

What would be the overhead of the actual zeroing of the padding? Especially for cg_clif, which barely performs any memory optimizations I imagine this will have a significant overhead. The only memory optimization performed by cg_clif results in a ~15% improvement. However that optimization is currently so dumb that I think zeroing padding will prevent performing this optimization. For cg_llvm it will just result in more ir to churn through with optimizations (=compiletime slowdown), or a runtime slowdown without optimizations.

RalfJung commented 4 years ago

Assuming the backend does naive copying (i.e., it always copies the entire struct and doesn't try to be clever by only copying the fields and ignoring padding), then the only overhead is in the constructor: Struct { f1, f2 } would have to compile to code that also writes zeroes to padding between and after the fields.

bjorn3 commented 4 years ago

Assuming the backend does naive copying (i.e., it always copies the entire struct and doesn't try to be clever by only copying the fields and ignoring padding)

Yes

then the only overhead is in the constructor

That is exactly what I am afraid of having a >5% overhead on compile time and, when not using optimizations, runtime. Especially for enums it may result in a lot more stores being generated.

bjorn3 commented 4 years ago

Also for the mir returned by instance_mir, the constructor is desugared into individual field assignments. This means that you either have to zero the full local in the prelude, or you have to do a complex analysis to find which bytes will be overwritten later. For the later you can't just take the field offsets, as different enum variants may have different padding.

RalfJung commented 4 years ago

I was thinking of struct padding only so far, where it should be pretty trivial to write some zeroes between the fields.

But I also don't see the issue with enums. In a constructor, the enum variant is statically known. So you just look at the fields of the initial variant, and zero the gaps between them.

That is exactly what I am afraid of having a >5% overhead on compile time and, when not using optimizations, runtime.

That sounds like a wild guess to me -- do you have any foundation for it being 5% rather than, say, 0.5%? I'd expect the impact to be tiny, since most structs will have way more data bytes than padding bytes and most code is not running constructors.

bjorn3 commented 4 years ago

But I also don't see the issue with enums. In a constructor, the enum variant is statically known. So you just look at the fields of the initial variant, and zero the gaps between them.

I guess zeroing during the codegen of SetDiscriminator will work. That will unnecessarily duplicate zeroing in some cases though.

That sounds like a wild guess to me -- do you have any foundation for it being 5% rather than, say, 0.5%? I'd expect the impact to be tiny, since most structs will have way more data bytes than padding bytes and most code is not running constructors.

Not much, but given the huge speed win of the very very dumb memory optimization I implemented (bail out on address taken, more than one store possibly preceding a load, ...) I expect the cost to be relatively high. I can try to benchmark it for cg_clif though.

upsuper commented 4 years ago

Assuming the backend does naive copying (i.e., it always copies the entire struct and doesn't try to be clever by only copying the fields and ignoring padding), then the only overhead is in the constructor: Struct { f1, f2 } would have to compile to code that also writes zeroes to padding between and after the fields.

Another potential optimization is that if reading the padding is undefined, the compiler might be able to utilize it to vectorize some writing even if such writing would overwrite the padding area, but if pading is defined to be zero then such optimization may be less feasible.

Not sure whether that's something really happening, though.

eddyb commented 4 years ago

I don't know what you mean. When I define a struct/union in LLVM, won't it add padding in between the fields?

Only if you use those types as "first class aggregates", as in, by-value. Oh and Abi::Scalar should be fine since we use it only when there is no padding after the scalar, so only Abi::ScalarPair and Abi::Vector lose padding (outside of non-Rust call ABIs).

When used for field access only, like we do, it's just a silly way to specify a field offset, nothing more.

llvm.memcpy calls don't see a type, they just know how many bytes to copy, and we pass the full size (aka "stride"), not the smaller unpadded size, anyway (mostly because we don't track the latter).

Also, if you look at this example's LLVM IR, this is how we lower non-unions:

#[repr(C)] // To avoid field reorder.
pub struct S(i16, i32, i64);

%S = type { [0 x i16], i16, [1 x i16], i32, [0 x i32], i64, [0 x i64] }

Note how we've made the padding fields explicit, and fill it in even when LLVM would compute the right offsets without it anyway? Well, it's because LLVM has no way to specify a custom alignment.

And without tracking "alignment from LLVM leaves" for every layout, we can't easily know in the general case whether LLVM will compute the correct offset or not, so we always force it with explicit padding.

Anyway, all that "padding" is just to make LLVM compute the right offset numbers, we should seriously consider just using one zero-length array to force an alignment and one byte array for offsets.

@RalfJung I suppose the only way LLVM knows something like S above has padding, especially in the local stack frame, is that some byte ranges are never written to.

If they were, it wouldn't matter how you expressed the offsets, and unless used by value (which you shouldn't do because support is dismal) LLVM struct (tuple) "types" are a red herring.

RalfJung commented 4 years ago

Note how we've made the padding fields explicit, and fill it in even when LLVM would compute the right offsets without it anyway? Well, it's because LLVM has no way to specify a custom alignment.

Oh I see, so LLVM itself couldn't even do the "replace copy of struct by copy of fields skipping padding" transformation, because it does not know which fields are padding. Nothing would have to change here for guaranteed-zero padding.

Btw, with nightly features we can already "manually" zero-pad a struct, and layout optimizations can exploit that:

#![feature(rustc_attrs)]

#[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(0)]
struct ZeroPad(u8);

impl ZeroPad {
    const ZERO: ZeroPad = unsafe { ZeroPad(0) };
}

struct Test(u8, ZeroPad, u16);
fn main() {
    println!("{}", std::mem::size_of::<Option<Test>>());
}

Playground

RustyYato commented 4 years ago

@RalfJung We can get even simpler, just use an enum!

#[repr(u8)]
enum ZeroPad {
    Zero = 0
}

RalfJung commented 4 years ago

@KrishnaSannasi indeed, that works just as well. Nice catch :)

bjorn3 commented 4 years ago

I am currently trying to implement zero padding on SetDiscriminator and I already got a miscompilation for FieldPlacement::Arbitrary. This suggests that it is non-trivial to write code to find the padding.

RalfJung commented 4 years ago

It should be easier for structs though, right? As mentioned before, I mostly had structs in mind, where padding is much simpler than for enums.

It may make more sense to say that Deaggregator actually has to respect zero-padding in its lowering -- with per-field initialization, the necessary information is already mostly gone and would have to be carefully reconstructed.

bjorn3 commented 4 years ago

The enum variant is as hard as the struct variant if you take the route of zeroing during SetDiscriminant. I just used layout.for_variant().

RalfJung commented 4 years ago

Not struct variants. Structs. As in struct Foo { field1: type1, field2: type2 }.

bjorn3 commented 4 years ago

Yes, I meant structs. "struct variant" as in the variant of zeroing codegen for structs. I can understand the confusion though. By the way, turns out I was zeroing scalars as they get FieldPlacement::Arbitrary without fields.

RalfJung commented 4 years ago

But there is no SetDiscriminant for structs, is there? There is none here.

RalfJung commented 4 years ago

Also, for your LayoutDetails troubles, this could be helpful: https://github.com/rust-lang/rust/pull/69901

bjorn3 commented 4 years ago

I thought there was a SetDiscriminant to finalize the construction of every aggregate, even for structs. Something with partial initialization and borrowck. I guess I will try to zero during the function prologue instead.

RalfJung commented 4 years ago

I don't think this is a task for codegen. MIR has to make it clear when zeroing is necessary. That is the case when Aggregate initialization is used, but the Deaggregate path destroys that information.

eddyb commented 4 years ago

Part of the plan to get rid of the Deaggregate pass, btw, was to indicate full initialization with SetDiscriminant (which could be renamed to FinalizeVariant or something).

You could make Deaggregate keep SetDiscriminant to be able to access it in codegen.

bjorn3 commented 4 years ago

The cost of zeroing the padding seems to be less than I expected. Both during SetDiscriminant (enum only) and during the prologue (adt only) it doesn't seem to have a negative impact. In fact it seems to slightly improve performance. I don't know if this just my computer warming up (the benches with padding happened to be sorted later alphabetically) or if it is caused by for example the store to load forwarding of the CPU. More benchmarking with the final implementation method will likely be necessary.

Zeroing code for cg_clif

```rust let variant_layout = place.layout().for_variant(fx, variant_idx); // Remove `for_variant` when zeroing during the prologue let mut padding_ranges = vec![]; match &variant_layout.fields { layout::FieldPlacement::Union(field_count) => { padding_ranges.push( (0..*field_count).map(|field| variant_layout.field(fx, field).size).max().unwrap_or(Size::ZERO) .. place.layout().size, ); } layout::FieldPlacement::Array { stride, count } => { for field in 0..*count as usize { padding_ranges.push( variant_layout.fields.offset(field) + variant_layout.field(fx, field).size .. variant_layout.fields.offset(field) + *stride ); } } layout::FieldPlacement::Arbitrary { .. } => { let mut last_field = None; for field in variant_layout.fields.index_by_increasing_offset() { if let Some(last_field) = last_field { padding_ranges.push( variant_layout.fields.offset(last_field) + variant_layout.field(fx, last_field).size .. variant_layout.fields.offset(field) ); } last_field = Some(field); } if let Some(last_field) = last_field { padding_ranges.push( variant_layout.fields.offset(last_field) + variant_layout.field(fx, last_field).size .. variant_layout.size ); } else { //padding_ranges.push(Size::ZERO..variant_layout.size); } } } if let ty::Adt(adt_def, _) = place.layout().ty.kind { //if adt_def.is_struct() { //} else { // padding_ranges.clear(); //} } else { padding_ranges.clear(); } for padding_range in padding_ranges { let addr = place.to_ptr(fx).offset_i64(fx, i64::try_from(padding_range.start.bytes()).unwrap()).get_addr(fx); let size = padding_range.end.bytes() - padding_range.start.bytes(); fx.bcx.emit_small_memset( fx.module.target_config(), addr, 0, size, std::cmp::min(/*greatest_divisible_power_of_two*/(size as i64 & -(size as i64)) as u64, 8u64) as u8, /*FIXME*/ ); } ```

bjorn3 commented 4 years ago

Full benchmark details:

[BENCH RUN] ebobby/simple-raytracer
Benchmark #1: ./raytracer_cg_clif_no_zero_pad
  Time (mean ± σ):     13.009 s ±  0.853 s    [User: 12.964 s, System: 0.016 s]
  Range (min … max):   12.061 s … 14.689 s    10 runs

Benchmark #2: ./raytracer_cg_clif_release
  Time (mean ± σ):     10.194 s ±  0.454 s    [User: 10.162 s, System: 0.012 s]
  Range (min … max):    9.523 s … 11.052 s    10 runs

Benchmark #3: ./raytracer_cg_clif_zero_pad
  Time (mean ± σ):     12.320 s ±  0.416 s    [User: 12.302 s, System: 0.006 s]
  Range (min … max):   11.850 s … 13.185 s    10 runs

Benchmark #4: ./raytracer_cg_clif_zero_pad_arbitrary
  Time (mean ± σ):     12.618 s ±  0.684 s    [User: 12.583 s, System: 0.011 s]
  Range (min … max):   11.933 s … 13.932 s    10 runs

Benchmark #5: ./raytracer_cg_clif_zero_pad_prologue
  Time (mean ± σ):     11.997 s ±  0.076 s    [User: 11.976 s, System: 0.006 s]
  Range (min … max):   11.906 s … 12.155 s    10 runs

Benchmark #6: ./raytracer_cg_clif_zero_pad_prologue_enum_too
  Time (mean ± σ):     12.205 s ±  0.152 s    [User: 12.192 s, System: 0.006 s]
  Range (min … max):   12.096 s … 12.574 s    10 runs

Benchmark #7: ./raytracer_cg_clif_zero_pad_prologue_enum_too_release
  Time (mean ± σ):      9.440 s ±  0.081 s    [User: 9.425 s, System: 0.007 s]
  Range (min … max):    9.374 s …  9.656 s    10 runs

Summary
  './raytracer_cg_clif_zero_pad_prologue_enum_too_release' ran
    1.08 ± 0.05 times faster than './raytracer_cg_clif_release'
    1.27 ± 0.01 times faster than './raytracer_cg_clif_zero_pad_prologue'
    1.29 ± 0.02 times faster than './raytracer_cg_clif_zero_pad_prologue_enum_too'
    1.31 ± 0.05 times faster than './raytracer_cg_clif_zero_pad'
    1.34 ± 0.07 times faster than './raytracer_cg_clif_zero_pad_arbitrary'
    1.38 ± 0.09 times faster than './raytracer_cg_clif_no_zero_pad'

arbitrary => For the SetDiscriminant version: Don't ignore padding for FieldPlacement::Arbitrary. release => Run the memory optimization. prologue => Zero in the function prologue instead of during SetDiscrimintant. enum_too => For the prologue version: Don't ignore enums.

workingjubilee commented 4 years ago

Possibly relevant prior art, approaching guarantees of zero-initialized padding from a security perspective, and avoiding various accidents that might occur due to zero-initialized padding: Solving Uninitialized Stack Memory on Windows

atagunov commented 3 years ago

In fact it seems to slightly improve performance

Could it be

writing a partial cache line requires CPU to first fetch that line from RAM + write it back
writing a full cache line requires CPU to do writing only?

cuviper commented 3 years ago

Struct { f1, f2 } would have to compile to code that also writes zeroes to padding between and after the fields.

What about MaybeUninit<Struct>, does that also need to be created with zeroed padding? If I write f1, then f2, then call any of the assume_init* methods, I expect to be in a good state.

edit: @comex also mentioned this in https://github.com/rust-lang/unsafe-code-guidelines/issues/174#issuecomment-516688526.

RalfJung commented 3 years ago

If I write f1, then f2, then call any of the assume_init* methods, I expect to be in a good state.

This assumption would not be true for repr(GuaranteedZeroPadding).

Something has to give, we won't get more niches for free.

cuviper commented 3 years ago

This assumption would not be true for repr(GuaranteedZeroPadding).

I see, I guess that's what you meant here:

(either for all repr(Rust) types or with per-type opt-in)

So repr(Rust) probably isn't feasible, but a new repr could be the opt-in.

eddyb commented 3 years ago

Small nit: #[repr(C)] and #[repr(Rust)] naming styles are exceptions because they name languages, it should be #[repr(guaranteed_zero_padding)] instead.

But also I would expect something like #[repr(padding(zeroed))] or maybe even a more general form of #[repr(padding(fill_with(0)))].

RalfJung commented 3 years ago

I totally meant this syntax as a strawman. :)

rust-lang / rust

Idea: guaranteed-zero padding for niches #70230