Open Centril opened 5 years ago
Padding bytes in a value are always undefined. If I read a value from a union, the padding bytes in the value are still undefined even if those bytes were well defined in the union through a write from a different variant. Reading that value does not change the status of whether a given byte is defined in a union.
Do you mean "from a union field"? Then I agree -- that is a "copy at the field type", the fact that we are reading from a union does not have any effect.
When writing a value to a union, any bytes in the union that correspond to padding bytes in the value will be set to undefined. Even if they were earlier set to a well defined value through a different variant, they are now undefined.
Again, do you mean "to a union field"? If yes, I agree again because unions are not really involved. The assignment happens at the type of that field, and only that type's rules apply.
Is there any way that these rules could be stricter?
Yes: in your last clause, we could just say "when copying a union wholesale, all of its bytes are preserved". So if you e.g. zero-initialized the memory storing a union, and then coy that union, you could rely on the copy still being all zeroes, even at its padding bytes.
But that seems to not be what C does, and it seems there are calling conventions where we do not have the option of preserving all bytes.
@RalfJung
Yes, I did mean union fields.
Yes: in your last clause, we could just say "when copying a union wholesale, all of its bytes are preserved". So if you e.g. zero-initialized the memory storing a union, and then coy that union, you could rely on the copy still being all zeroes, even at its padding bytes.
I meant stricter in the sense that less code is legal under those rules. Not preserving any byte which is padding or undefined results in less code that is legal.
I meant stricter in the sense that less code is legal under those rules.
Oh I see.
Yes: at https://github.com/rust-lang/unsafe-code-guidelines/issues/73 some people propose that with a union like
#[repr(C)]
union U {
f1: (bool, u8),
f2: bool,
}
it should be UB to have the first byte of the union be anything but 0x00 or 0x01. Basically "if some byte of the union is subject to a restriction for all fields, the restriction also applies to the union as a whole". That would allow enum optimizations for unions. (When accessing a field that is already UB because then the field's type's rule apply, but we are talking about handling the union wholesale.)
But I think defining this precisely is too complicated and we should rather provide people with the tools to define their own "niches" for enum optimizations.
We discussed this briefly on the language team meeting today. Attendance was low. However, those present agreed that it would be OK to move towards stabilizing repr(transparent)
on univariant enum
s specifically, while leaving the hint on union
s unstable for the time being. I will write up a short report & PR.
Stabilization report & PR for enum
s is up at https://github.com/rust-lang/rust/pull/68122.
This is now just tracking transparent unions.
I think I just realized there is a problem with transparent unions if we want to provide "bag of bytes" semantics for unions:
#[repr(transparent)]
union U { f: u32 }
As a union, U
should be just a bag of bytes. If I make one of the bytes poison
, then after copying the union around, that should properly preserve which byte is poison
and which one is not.
However, to my knowledge, LLVM actually makes an i32
either fully poison or not poison at all -- so the moment such a partially poisoned U
gets loaded as an i32
in LLVM, the remaining bytes would lose their content and reset to poison
as well. Ouch.
I have two questions:
#[repr(transparent)]
?f
; I'm not sure what else you'd expect to happen, with or without #[repr(transparent)]
.I'm still not sold on the "bag of bits" idea. I've tried to put an alternative forward but instead of receiving a good rebuttal for why my alternative is inferior it instead go sidetracked by "let's decide what padding bits are" when I think we're pretty much already in agreement on what "padding bits" means. Is this something that a chat on Zulip or something could solve? I think some real-time vocal communication could resolve this in minutes, whereas asynchronous text comms will take days (or weeks) (and GitHub is a suboptimal forum due to it hiding replies and whatnot).
Why is that not a problem if you omit
#[repr(transparent)]
?
Since repr(Rust) unions have no ABI commitments, we can just represent them as a literal byte array. That allows each byte to be poison-or-not independently.
@hanna-kruppe That's true, but I'm not sure it's useful to allow such granularity for poison because you still can't read f
. You can only memcpy the non-poison bytes of f
, which I'm not sure Rust should guarantee you can do because it's such a big footgun (and I can't think of any real value it provides).
That goes more into your second question (why/how this observation matters), which I deliberately didn't go into because it's a more complex topic and my time is currently very limited.
Why is this so bad? You just poisoned f; I'm not sure what else you'd expect to happen, with or without #[repr(transparent)].
#[repr(transparent)]
union U { f: u32 }
let mut u = U { f: 0 };
(&mut u as *mut _ as *mut MaybeUninit<u8>).add(1).write(MaybeUninit::uninit());
let u2 = u;
let v = (&u2 as *const _ as *const u8).read();
println!("{}", v);
The Rust semantics as they exist in my head and as it is drafted here would guarantee that this program prints 0. This is what "unions are just bags of bytes" means.
But it turns out LLVM actually says "nope this is UB as all of u2
is poisoned". So, the desired semantics is not implementable with LLVM while upholding repr(transparent)
guarantees.
Of course we could try to adjust our semantics, but (a) that will make the semantics of unions significantly more complicated, and (b) it seems like a shame that LLVM would force us to cripple our semantics like that, for no good reason. If LLVM's type system was not quite so restrictive, we could just tell LLVM to load 4 bytes at once and preserve which byte is poison
and which is not. But LLVM conflates "bytes" with "integers that have arithmetic operations", and then a lot of sadness ensues.
Thanks for the demo, @RalfJung. I personally think this should have the same behavior as #[repr(transparent)] struct S { f: u32 }
. I'll take some time to read through the draft you linked to, though.
@mjbshaw do you still think that if we replace U
by MaybeUninit<u32>
? Since MaybeUninit
is repr(transparent)
, that's basically the same type.
Doesn't writing MaybeUninit::uninit()
still write undef and not poison? Unless I'm quite mistaken, in this playground, it's undef
and not poison
that gets passed to write_to_float
. And since undef is tracked bitwise, this all works out doesn't it?
Of course, if LLVM does change uninitialized memory to poison, then we would run into this problem, without bitwise semantics.
Indeed, this was written assuming LLVM would switch to poison
eventually.
Is the concern in this comment about poison spreading to entire union fields not equally valid for #[repr(C)]
unions that must be passed as integers per platform ABIs?
Yeah sure, that's the same situation.
Judging from this discussion, it seems like LLVM will get a "freezing load" operation. Whenever we have a load that would allow partially uninit data, we could use the freezing load and be sure that the data is preserved correctly. This does lose some information, but at least it would resolve the concern about spreading poison to neighboring bytes.
So, poking around the various open issues and PRs and this looks like the best place to mention this.
I mentioned in #101179 that I think that allowing DSTs in MaybeUninit
might be a way to improve the API for slices (by allowing MaybeUninit<[T]>
), but it appears that DSTs are still not allowed in unions, even #[repr(transparent)]
ones like MaybeUninit
. The closest this was to being implemented was #47650 which got closed because the author didn't have time to finish it.
Was there ever an explicit reason to disallow this, or was it just not implemented/considered since no one had discussed it much?
The problem is custom DST. If CStr
becomes a DST, then size_of_val
might have to actually read the data behind the reference to determine the length of the value -- but clearly with a MaybeUninit<CStr>
that would be bad.
Ah, right -- we haven't fully eliminated the possibility of thin DSTs. I guess that we're still unsure what a proper custom DST RFC would look like, although my guess is that CStr
will be a thorn in the side for most of them. C strings really are the worst. (I say this sarcastically, although I'm not surprised to find another reason why they're problematic.)
My gut feeling is to say that any RFC which would permit thin DSTs (which, as demonstrated by CStr
, would be something we want) is that we'd also want a mechanism to filter them out, precisely for cases like this. But I guess that this isn't a strong enough argument to be able to simply allow them now and deal with the consequences later.
This is a tracking issue for the RFC "Transparent Unions and Enums" (rust-lang/rfcs#2645).
Steps:
Unresolved questions:
Also it is not clear if transparent unions can even be implemented on LLVM without seriously restricting our semantics for unions overall.