Closed Thinkofname closed 4 years ago
cc https://github.com/rust-lang/rust/issues/36246 @arielb1
Should this drop test.b
or not? I can't figure that out from the RFC.
That println is distracting, so I reworded the example a bit:
{
let mut test = Test { a: A(3) }; // test.a is initialized -> test.b is initialized, test is initialized
{
let tmp_b = test.b; // test.b is uninitialized (moved out) -> test.a is uninitialized, test is uninitialized
// tmp_b is initialized
} // tmp_b is dropped, B is printed
test.b = B(0.5); // test.b is initialized -> test.a is initialized, test is initialized
// test.b is not dropped on assignment because it was uninitialized
} // test is dropped (noop)
And if you move out a subfield?
#![feature(untagged_unions)]
union U {
x: (Box<u32>, Box<u32>),
y: Vec<u32>,
raw: [u64; 4]
}
fn main() {
let mut u = U { raw: [0; 4] };
unsafe {
drop(u.x.1);
if true {
u.y = Vec::new(); // what does this drop?
} else if true {
u.x = (Box::new(0), Box::new(1));
} else {
u.x.0 = Box::new(0); // and this?
}
}
}
@arielb1
#![feature(untagged_unions)]
union U {
x: (Box<u32>, Box<u32>),
y: Vec<u32>,
raw: [u64; 4]
}
fn main() {
// init u.raw -> init (u.x, u.y) -> init u
let mut u = U { raw: [0; 4] };
unsafe {
// uninit u.x.1 -> uninit u.x -> uninit (u.y, u.raw) -> uninit u
// u.x.0 is still initialized
drop(u.x.1);
if true {
// init u.y -> init (u.x, u.raw) -> init u
// u.y was uninitialized, not dropped on assignment
u.y = Vec::new(); // what does this drop?
} else if true {
// init u.x -> init (u.y, u.raw) -> init u
// u.x was uninitialized, not dropped on assignment
u.x = (Box::new(0), Box::new(1));
} else {
// assign u.x.0
// u.x.0 was initialized, dropped on assignment
u.x.0 = Box::new(0); // and this?
}
}
}
I tried to describe the scheme in https://github.com/petrochenkov/rfcs/blob/e5266bd105f592f7408b8592c5c3deaccba7f1ec/text/1444-union.md#initialization-state and https://github.com/petrochenkov/rfcs/blob/e5266bd105f592f7408b8592c5c3deaccba7f1ec/text/1444-union.md#move-and-initialization-checking
@petrochenkov
That... sounds like something that would be fun to implement.
I think people will be confused by these run-time semantics. Should we prohibit replacing into partially-initialized unions?
Quite likely a duplicate of https://github.com/rust-lang/rust/issues/48962
cc @retep007 -- can you quickly test if this is a duplicate of #48962 ?
@nikomatsakis It seems not to be related to #48962 . I could take a look at this
When I tried to minimise example I found out that if either of println!("{:?}", unsafe { test.b });
, unsafe { test.b = B(0.5); }
is removed it does not ICE. Any ideas?
What is the backtrace where the ICE occurs?
Though reading the back-and-forth between @petrochenkov and @arielb1, it sounds like there is some confusion also as to what we should do upon assignment to a union field. I guess this has to do with whether to drop the pre-existing data or not (in part because the compiler doesn't have visibility into whether or not the union is initialized). I don't know that we can blame this ICE on NLL per se.
@nikomatsakis
This is not an NLL problem - it occurs even without #[feature(nll)]
, but rather is a problem with the unstable #[feature(untagged_unions)]
. The problem is that the spec for when union fields are valid had never been implemented in dropck. Reading the old RFC 1444 move semantics @petrochenkov linked to (which are what is currently implemented in AST borrowck - I'm not sure this is specified clearly enough in MIR borrowck) again, I find it implementable enough (although I haven't looked at all edge cases).
Note that the move semantics were removed from the newest version of the union spec.
However, I'm not sure that implementing the specification as written is the best idea - these move semantics interact in a very weird and inconsistent-seeming way with assignments.
union U {
a: (Box<u32>, Box<u32>),
b: (Box<u32>, Box<u32>),
}
let mut u = U { a: (Box::new(0), Box::new(1)) };
// u.a & u.b are both initialized
u.b.0 = Box::new(2); // this would call a destructor
drop(u.a.0);
// u.a.0 is uninitialized, u.b is also now uninitialized,
println!("{:?}", u.a.1); // prints `1`
u.a.1 = Box::new(3); // this would call a destructor
// u.b is *still* uninitialized, same as before.
println!("{:?}", u.b.1); //~ ERROR use of moved value
u.b.1 = Box::new(4); // this would *not* call a destructor, and would leak a `Box`
I feel that implementing this "as specified" will lead to a situation where it will be hard to predict whether writing to a union field will call a destructor or not.
I can think of 2 ways of fixing this inconsistency:
cc #32836 - this has to resolved one way or another before untagged_unions
is fully stabilized.
@arielb1 I'm looking at the example you gave, and that seems like unsafe code doing something wrong, by accessing u.b
after manipulating u.a
. I do agree that it seems error-prone, though, and it seems quite possible that people would make invalid assumptions about what they can do safely based on how fields of the same type overlap.
That said: this seems like it only applies to union fields that implement Drop
. We've stabilized union fields without Drop
in 1.19, and stabilized ManuallyDrop
(which has only one variant). We don't have stable unions with Drop
fields, as far as I know. So, this doesn't seem like it affects anything we have stabilized, right?
@arielb1
I can think of 2 ways of fixing this inconsistency:
I'd rather prefer something like 1 with more errors to something like 2 with double frees, to keep "union newtypes" with a single field like NoDrop
completely safe and to keep the "enum interpretation" of unions in general.
FWIW, I still see the behavior of code in https://github.com/rust-lang/rust/issues/41073#issuecomment-380248006 as logical and "following from the first principles", but I agree that nobody would want to deal with such puzzles in practice, so I'm all for generating more errors for sanity's sake even if they are "artificial" in some sense.
to keep "union newtypes" with a single field like NoDrop completely safe and to keep the "enum interpretation" of unions in general.
I don't think union "newtypes" with a single field should behave like newtypes at all. Unions are inherently unsafe and only when accessed can we make any assumptions about what they contain. I think an "enum interpretation" is incorrect.
Unions were added primarily for FFI purposes, and in C they most certainly do not have an "enum interpretation". For example, there are libraries that use unions as a way to be extensible -- as the library evolves, new union variants are added, and this is binary compatible with old clients that know about fewer variants. (IIRC it was @joshtriplett who brought this up.) A single-variant union just means that this is the only variant we know of, but unless we actually use that variant, we should not assume that this is the variant we are in.
TBH the most reasonable behavior for assigning to union fields that I can come up with is to never drop anything. That's at least predictable, and assignment is unsafe after all. I think any scheme that attempts to track union initialization state is bound to backfire. And I am certainly extremely surprised that initializing a field makes its siblings initialized. I would have expected the exact opposite---initializing a union field deinitializes its siblings. Why does the RFC make this strange choice? That's even more surprising if I consider that you want to follow an "enum interpretation", where certainly initializing one variant deinitializes the others.
One point I am missing from the RFC: What is the initialization state when a union goes into scope, e.g., when it is passed to a function?
@RalfJung
And I am certainly extremely surprised that initializing a field makes its siblings initialized. I would have expected the exact opposite---initializing a union field deinitializes its siblings
Well, to support transmuting with unions
union U { f: f32, i: i32 }
let u = U { f: 1.0 }; // Initialize one field, another becomes initialized too
let i = unsafe { u.i }; // OK, `i` became initialized together with `f`, so we don't get an error here
https://github.com/rust-lang/rfcs/pull/1897 contains a number of examples that serve as motivation for the currently implemented rules, but I never got enough time to push it over finish line. It's sad that discussion of union semantics is now spread over various random issues and mostly ignores that RFC.
What is the initialization state when a union goes into scope, e.g., when it is passed to a function?
To pass something (not necessarily a union) to a function you have to borrow or move/copy it and you can't borrow/move/copy something uninitialized, so all function parameters are in initialized state initially.
(EDIT: here and above I'm talking about initialization state as it's tracked by the compiler statically (with move checker) and dynamically (with drop flags), not stuff like let x = mem::uninitialized()
of course).
To pass something (not necessarily a union) to a function you have to borrow or move/copy it and you can't borrow/move/copy something uninitialized, so all function parameters are in initialized state initially.
Ah, that's a good point. I hadn't thought of this because I did not really accept the "writing to a union field initializes the siblings" rule, but with that rule, this actually makes sense.
However, doesn't that mean if I have two types that implement Drop
, say
union Foo {
x: String,
y: Box<i32>
}
and then I do something like
let u = Foo { x: "abc".to_owned() };
u.y = Box::new(42);
then everything explodes because this will cause drop
to run on u.y
prior to the assignment?
The more I look at this, the more I think that maybe having Drop
fields in unions at all is just a mistake. People should write
union Foo {
x: ManuallyDrop<String>,
y: ManuallyDrop<Box<i32>>
}
It's sad that discussion of union semantics is now spread over various random issues and mostly ignores that RFC.
Yeah, it's a huge mess :/ Also, I wasn't even ware of the extent of this RFC. Clearly I didn't do my research. Sorry for that! I hope to find some time to read it soon, that's lots of text.
everything explodes because this will cause drop to run on
u.y
prior to the assignment?
Yes, that's why assignments to Drop
fields require an unsafe block so you can explicitly promise to the compiler that dropping that field won't cause any bad consequences. (So the author of unsafe { ... }
would be responsible for the explosion in this case.)
u.y = Box::new(42); //~ ERROR assignment to non-`Copy` union field requires unsafe function or block
Yes, that's why assignments to Drop fields require an unsafe block so you can explicitly promise to the compiler that dropping that field won't cause any bad consequences.
Fair enough. This is still a footgun though, e.g. all this may happen in an unsafe fn
so the author doesn't even realize that the assignment is unsafe. TBH this reminds me of C making integer addition unsafe: There should be some context-free syntactic way to easily tell that an operation is unsafe. People thinking "unions are for FFI so they work like they do in C" will have a hard time, and probably write buggy code.
I do think that unions with Drop fields are a massive footgun regardless. They do require unsafe
code, at least.
I am reminded of what I recently said in the context of references to fields of a packed struct:
I also feel this violates the usual pattern in Rust that you should not make unsafe operations accidentally. The only other non-function-call operations we have that are unsafe involve dereferencing a raw pointer, and everyone kind-of knows why that is unsafe (though they may forget about alignment) [and I should have added reading a union field here]. That the mere act of
taking a referenceassigning to a field can be UB will probably be surprising to many. If they are doing this e.g. in an unsafe fn, they will never even see the warning/error, no matter how much we improve it.
Notice that assigning to a union field is always safe in C; only reading it can yield UB.
@RalfJung
Unions were added primarily for FFI purposes, and in C they most certainly do not have an "enum interpretation".
Unions in (standard) C/C++ have much stronger enum interpretation than in Rust, C and C++ have a notion of "active field" of a union and you can't even access non-active fields (i.e. transmute with unions) in general.
(Of course, C/C++ have much looser static analysis so you can create and pass around uninitialized unions, but accessing them is still UB.)
C and C++ have a notion of "active field" of a union and you can't even access non-active fields (i.e. transmute with unions) in general.
AFAIK the following is legal:
union int_or_short { int x; short y; } u = { .x = 3 };
printf("%d\n", u.y);
You cannot access non-active fields through pointers, but you are generally allowed to access it "through the union type". (The standard is somewhat ambiguous about this, but Wikipedia agrees and GCC explicitly says it is okay.)
Above, we were talking about accessing the union directly, i.e. through the union type, in which case C/C++ does not apply restrictions based on the "active variant".
The restriction for accessing the "wrong" field through a pointer is necessary to make type-based alias analysis work. I didn't check the old standards, but I'd guess before the aliasing restrictions were introduced, unions were purely access-based as well.
@RalfJung Yeah, that's why I said "in general", there are limited ways to do type punning through unions in standard C (depending on version, IIRC) but not in C++. My point was that the "base" model in those languages is the enum one, limited type punning is an extension out of necessity.
https://github.com/rust-lang/rfcs/pull/1897 proposes roughly the same model, i.e. "enum" interpretation is the base, so there's always an active field that's guaranteed to be valid, but you can also access inactive fields at your own risk as an extension.
My point was that the "base" model in those languages is the enum one, limited type punning is an extension out of necessity.
I actually disagree. I think the enum one exists solely to enable type-based alias analysis (which we decisively do not want in Rust); if it weren't for that, C would be just fine without any notion of an "active variant".
rust-lang/rfcs#1897 proposes roughly the same model, i.e. "enum" interpretation is the base, so there's always an active field that's guaranteed to be valid, but you can also access inactive fields at your own risk as an extension.
I'm not at all happy about this proposal. This extra hidden state makes things much more complicated. I think a model where unions are just bytes, and then on every read/write we decide what to do with these bytes, is much simpler.
For example, imagine you'd want to extend miri to actually be able to fully check violations of these rules for accessing union fields. I think that is a very desirable feature to have---it's not just practically useful (we could use miri to test for UB violations in unsafe code), it also means that it is even possible to have such a checker. IMO the fact that such a checker is likely impossible for C is one of the biggest faults of the language, and a mistake I'd like to not repeat in Rust. How would such a miri extension work in an "active field" situation? It seems it'd have to somehow track this extra state on the side, and be very careful about updating that state when pointers are used to access the union. It all seems rather messy to me. In contrast, if every access is just based on the bytes in memory and the type and offset of the field that is accessed, then we already have the full checker implemented in miri -- this is literally how every other access in Rust works as well. I am strongly opposed to unions introducing vast amounts of complexity in this machinery.
Given that we don't want TBAA, I see no reason to adapt an "enum interpretation". It's a much more complicated model, and I don't see any tangible benefits spelled out in the RFC.
Now that RFC 2514 has been accepted (tracking issue #55149), the plan of record is to not allow unions to have a field with drop glue.
(Thus, the main issue here, once #55149 is implemented, will be to just make sure we signal a proper error in this case, rather than ICE.)
Full backtrace can be seen on the playground
https://play.rust-lang.org/?gist=bc93f5b222b09b31d3fd286cfc10858a&version=nightly&backtrace=2