rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do
https://rust-lang.github.io/unsafe-code-guidelines
Apache License 2.0
669 stars 58 forks source link

What are the values of a union type? (in particular, what is the validity invariant of a union) #438

Open RalfJung opened 1 year ago

RalfJung commented 1 year ago

When loading data at union type from one place and storing it in another -- what exactly is being required about the bytes, and what is preserved? In other words, what is the representation relation of a union type, and what is even the set of values that can be generated by a union-typed load?

Originally I intended the answer to be "the values are just lists of raw bytes as large as the union". IOW, a union has no validity requirements whatsoever, and all data inside it is perfectly preserved. However, reality disagrees. With the C ABI, some of the bytes in a union value do not always get preserved -- there can be padding. Also there is some desire to make unions have validity invariants, so that e.g. a union { usize, *const T } is noundef.

On the other hand, it is certainly not the case that the value of a union is "a value of one of its fields". Apart from the problem about writing decode for such a specification, it can be violated in safe code.

Note that for the purpose of this, we do assume that the LLVM type iN will perfectly preserve N bytes of memory. LLVM code generated from Rust can never put poison in memory (as that would not be preserved on a per-byte level, the entire iN gets "infected" if any byte is poison), which is currently accurate (Rust code has UB for any situation where the generated LLVM code might produce poison). LLVM iN values carrying provenance is incoherent but mostly works in practice and LLVM does not offer us a better option. We need a "byte" type to properly express our intent to LLVM; until that exist, we make do with what we have. If you want to discuss LLVM issues, please open a new thread -- this one here is solely about the Rust AM semantics.

Possible designs

MiniRust decides that a value of union type is a list of "chunks" where the bytes are perfectly preserved, but there can be gaps between those chunks where data is lost of a copy. The extent of these chunks can be computed from a Rust union type layout. That is sufficient to model the reality of union padding. The effective validity invariant here is still "any list of bytes allowed".

The noundef property could be achieved if we refine this "chunk" idea a bit further: every byte in a union is either "discarded", "preserved", or "initialized". Bytes which are padding in all variants get "discarded"; bytes which are padding only in some variants get "preserved"; bytes which are "in the data part" for all variants get "initialized". A value of union type is a list of bytes the size of the union with the constraint that "discarded" bytes must be Uninit and "initialized" bytes must not be Uninit. The validity invariant requires that all "initialized" bytes must be initialized -- and that's it; "discarded" bytes get flushed to Uninit on a decode. Then the Rust type union { usize, *const T } would translate to a union type where all bytes are "initialized", and we could emit noundef attributes for such a type. (I don't want multiple distinct AM values to be equivalent, hence the 'value of union type' predicate demands that all "uninit" bytes are truly Uninit, but of course the validity invariant allows those bytes to be anything. Note that the validity invariant is a predicate over "list of bytes", whereas "what are the values of this type" is a predicate over Value. The chunks idea achieves this by just not having the padding bytes in the Value; we could do the same here but that would be formally more awkward to express I think. Maybe I should just link to a MiniRust patch here, that might be more clear than trying to say this in English.^^)

We could in principle go further and tro to achieve something along the lines of "if a byte is non-null in each variant, it is also non-null in the union", but that becomes increasingly complicated -- in particular it requires the "generate union type description from Rust type" logic to encode more and more knowledge about validity invariants, when the entire intent of this "value representation" business was that the validity invariant is implicitly encoded by the representation relation.

What other designs should we consider? And are they worth the complexity?

scottmcm commented 1 year ago

As I was trying to figure out how to evaluate the possibilities, I started thinking that unions might be serving too many masters. I wonder if they want to be multiple different things? It might be that the "sugar for transmuting between values" use and the "enum that's always unsafe because there's no stored discriminant" intents are just so different that they should be different things...

RalfJung commented 1 year ago

So... basically, let's imagine what could be done if we didn't have to support code like this?

I first thought "not much". While validity might be defined as "any of the variants is valid", it becomes harder to define what happens on a typed copy. Consider a union of (u8, u16) and (u16, u8) (or rather, their repr(C) equivalents): if we do a copy at any variant type, at least one byte will be lost to padding. But if the original value is [0, 0, 0, 0], then we must copy and preserve all bytes, since we can't know which is the "active variant".

But there's actually a way to achieve that: we could say that a value of union type is a non-empty set of values of possible variants. So in decode, we decode the data at all variant types, and we put all the values that were successfully decoded into the union value. If none of the variant decodes successfully, we trigger UB. In encode, we encode all of them, and then we take the bytewise "maximum" of these N encodings along the "definedess" order defined in this section to compute the resulting output. Since all these values were created from the same input bytes, such a maximum must exist. (So really a value of union type is a non-empty set of values that are all mutually coherent -- in the sense that when they are encoded, the byte lists are related by "definedness".)

Another option would be to angelically pick which variant to do the copy at, but I'd rather avoid angelic non-determinism if possible.

That definition should give us maximal niche power. However, we also probably don't want it for MaybeUninit: under that definition, MaybeUninit<u8> would not preserve provenance! It would strictly behave as "either () or u8". Was it a mistake to say that MaybeUninit<u8> preserves provenance? Or do we truly need both kinds of unions (even disregarding backwards compatibility)?

Also, implementing this in Miri will be quite terrible...

Jules-Bertholet commented 1 year ago

Could we have different reprs to make the distinction? For example, #[repr(Rust)] unions are "every byte, considered independently, must either be valid at that position for one of the union's fields, or be undef", but #[repr(init)] doesn't allow undef, and #[repr(strict)] must always be fully valid for at least one of the fields.

Lokathor commented 1 year ago

I think one goal is for union { f32, u32 } to be "always valid for either field", even without a declared repr. Which doesn't mean we can't support other reprs, just that the default Rust repr should be the one which leads to safe code being able to use unions (when the fields line up appropriately).

saethlin commented 1 year ago

What optimizations can we get from noundef and from passing through niches?

For noundef I think we already apply the attribute to scalar loads where the scalar type does not permit being uninit, so I'm not really sure what additional optimizations open up here. Are there optimizations permitted by noundef unions that are not also justified by validity on typed copy?

For niches, doing layout optimizations on unions does sound like it would have some plausible value in general, but I expect a significant fraction of union use either doesn't have niches or the union is implementing tag packing itself in which case the user doesn't mind if the union never exposes a niche. Is there more value to exposing niches than permitting enum layout optimizations?

RalfJung commented 1 year ago

I think one goal is for union { f32, u32 } to be "always valid for either field", even without a declared repr.

Is it? I don't think so. It's not my goal, anyway. ;)

the default Rust repr should be the one which leads to safe code being able to use unions (when the fields line up appropriately).

When and how do you want safe code to be able to read union fields?

For noundef I think we already apply the attribute to scalar loads where the scalar type does not permit being uninit, so I'm not really sure what additional optimizations open up here. Are there optimizations permitted by noundef unions that are not also justified by validity on typed copy?

Hm, good question. That would depend on how smart LLVM is about these attributes, but of course we can always look to make it smarter. The problem @scottmcm ran into that made him ask for union nonnull is blocked on https://github.com/rust-lang/rust/issues/114383, which doesn't need any changes to the representation relation.

Lokathor commented 1 year ago

XD I've mentioned this to you before Ralf. I thought that's what you were already talking about when you said "there is some desire to make unions have validity invariants".

Many people want a future rust to support safe union field access when the fields of the union satisfy some sort of specific case that the compiler can automatically check by itself. Whatever is decided for unions now should hopefully allow for that to happen some day.

Being intentionally vague, and specifically expecting the reader to interpret the following words in the most positive light, the assumption is usually something like: "if the safe-transmute project completes, and safe-transmute could convert all field types to each other, then that union should obviously be fine to use from safe code."; So something like union {f32, u32} or union {u64, [u32; 2]} or similar things like that.

scottmcm commented 1 year ago

Are there optimizations permitted by noundef unions that are not also justified by validity on typed copy?

That would depend on how smart LLVM is about these attributes, but of course we can always look to make it smarter.

I tried to explore this a bit, but I don't know how successful I was. https://llvm.godbolt.org/z/4K9q9d3Yq

If I put a typed load in a function,

define i32 @_ZN7example3foo17h02b60a93fb12ffb0E(i32 %0) {
start:
  %x = alloca i32, align 4
  store i32 %0, ptr %x, align 4
  %_2 = load i32, ptr %x, align 4, !noundef !1
  ret i32 %_2
}

That just gets completely optimized out today, as far as I can tell

define i32 @_ZN7example3foo17h02b60a93fb12ffb0E(i32 returned %0) {
  ret i32 %0
}

Though Alive says it would be legal for it to mark the return type as noundef: https://alive2.llvm.org/ce/z/s5ttA7

So I don't know if the typed load would be sufficient in practice to get noundef-related optimizations. Probably depends on evil phase ordering questions :/

RalfJung commented 1 year ago

@Lokathor IIRC my stance in that discussion was that that isn't even a validity invariant question, it is primarily a safety invariant question. Or at least, it is certainly the case that safety invariants are sufficient to achieve what you are asking for, since it's all about "how can we soundly expose operation X to safe code".

So to make this a validity invariants requires stronger motivation.

I thought that's what you were already talking about when you said "there is some desire to make unions have validity invariants".

I was mostly thinking of @scottmcm, and of people asking for niches in unions.


@scottmcm that's not quite how the code would look, would it? If this is a getter fn get_int(Union) -> usize then rustc will already add noundef to the return value. I would expect the !noundef on the load to be sufficient for optimizations that come later and that work on %_2?

Was there a specific optimization you were looking for in the iterator code that went missing, or was it more a general concern without concrete examples?

joshlf commented 2 months ago

Am I understanding correctly that all of the proposals on the table satisfy the following property? Given a #[repr(C)] union where no value of any of the union's fields has padding, it is guaranteed that all valid values of that union will be free of uninitialized bytes?

E.g.:

#[repr(C)]
union Foo {
    a: u8,
    b: i8,
}

// It is guaranteed that `foo` contains no uninitialized bytes.
let foo: Foo = get_foo();

#[repr(u8)]
enum Bar {
    A(u8),
    B(u8, u8),
}

#[repr(C)]
union Bar {
    a: [u8; 3],
    b: Bar,
}

// `bar` might have uninitialized bytes if it was initialized as `b: Bar::A(...)`
let bar: Bar = get_bar();

The reason that this guarantee is useful is that it supports an oft-requested feature from zerocopy, which is to expose the underlying bytes of a union value. For example, one such use case is to permit safe field access. Another is to permit a kernel emulator to write a union into an untyped shared memory buffer (which requires viewing that union as a &[u8]).

If my understanding is correct, would folks be open to at least codifying this minimal guarantee in the Reference so that we can rely on it in unsafe code?

EDIT: cc @kupiakos, whose code needs this property

RalfJung commented 2 months ago

No that is definitely not the case. The proposal currently implemented in Miri and MiniRust does not satisfy this property: we say that all byte patterns are valid for all unions. (And furthermore, these bytes are exactly represented in the union value, except for bytes that are padding in all union variants -- those bytes are discarded in the value, i.e. they constitute padding for the union itself.)

I am also not convinced this is a guarantee we want. This is related to safe union field access which I think should involve some sort of opt-in.

chorman0773 commented 2 months ago

Also, you can make a union have a safety invariant of being initialized/storing one of its variants. I make a declarative macro in my emulator that does this and adds both safe access and bytemuck::Pod impls to the union. The language doesn't need to prohibit something for a library to.

joshlf commented 2 months ago

Also there is some desire to make unions have validity invariants, so that e.g. a union { usize, *const T } is noundef.

Quoting from the original comment here - is this still a goal? I understood the various designs that you outline in the first few comments in this thread as attempting to satisfy this goal.

IIUC, this is articulating the same goal that I have. Maybe I'm misunderstanding what this example was meant to convey?

scottmcm commented 2 months ago

Personally, I very much want a way to accomplish the goal of being able to store a conceptual-union of things without pessimizing to "yolo, might be anything including uninitialized". Being able to have something that's either u32 or f32 without needing to transmute or pointer-cast every where would be very nice, but isn't worth it if it costs a bunch of optimizations.

That said, if it helps opsem I think it'd also be ok to have that be something other than an ordinary union -- maybe we add an strict union/enum union which has precisely the "exactly one active field" rule, where any read through a variant other than the initialized one is immediate UB, and thus it disallows all the punning behaviours, as well as disallows the set-copy-subfields-in-safe-code kinds of things.

Lokathor commented 2 months ago

I think that it would be very useful to have unions that limit how much uninit they are without having to go all the way to having an active field.

joshlf commented 1 month ago

For those of you who have a similar use case: I filed https://github.com/rust-lang/unsafe-code-guidelines/issues/533, which would permit us to at least treat the desired initialization property as a safety property if not a soundness property (in other words, safe code could not violate it, although unsafe code could do so without being unsound).

RalfJung commented 1 month ago

I think that it would be very useful to have unions that limit how much uninit they are without having to go all the way to having an active field.

Yeah, agreed. And I also view "require init" as being a statement in the same category as "requires non-null".

So one option here is to let people annotate an attribute and the union that makes such validity requirements explicit, rather than implicitly inferring them from the field types.

But of course that would not work for generic code. Not sure if that s a concern that comes up here?