rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.23k stars 12.7k forks source link

Decide on when MIR Discriminant() operation is UB #91095

Open RalfJung opened 2 years ago

RalfJung commented 2 years ago

We do not currently have a clear description of what the semantics of the Discriminant() MIR operation, and the corresponding intrinsic (exposed via mem::discriminant()), are -- specifically, what are the safety preconditions of this operation, and when is it UB?

Note that this operation works on all types, not just enums. For valid values of non non-enum types it returns some valid integer value (currently, 0).

The implementation in Miri (to be restored with https://github.com/rust-lang/rust/pull/91088) does the minimum amount of work necessary to determine the discriminant: if the type has no discriminant (since there are not at least 2 variants), the operation is always defined; otherwise it reads the tag (which encodes the discriminant) and causes UB if that is uninitialized or does not encode a valid discriminant. (There are some thorny question here around what happens if the discriminant has provenance; I would like to keep that out of scope for this issue -- it should likely be treated like a ptr-to-int transmute, whatever we end up doing with that: https://github.com/rust-lang/unsafe-code-guidelines/issues/286.)

The codegen backend adds some extra UB for the case where the type is uninhabited: https://github.com/rust-lang/rust/blob/81117ff930fbf3792b4f9504e3c6bccc87b10823/compiler/rustc_codegen_ssa/src/mir/place.rs#L206-L215

We also have a related MIR optimization in https://github.com/rust-lang/rust/blob/93542a8240c5f926ac5f3f99cef99366082f9c2b/compiler/rustc_mir_transform/src/uninhabited_enum_branching.rs. I am not quite sure what this does though, it seems to be more about assuming that if a particular enum variant is uninhabited then we will never see the discriminant for that variant, and can hence remove it from the SwitchInt?

An 'obvious' choice is to say that the value passed to the Discriminant operator must fully satisfy its validity invariant -- that would certainly justify both the MIR optimization and what the codegen backend does. However, this also has problems:

These observations make me doubtful that requiring full validity is the right thing. Making the fewest assumptions is appealing IMO, but not compatible with our codegen backend nor with the MIR optimizations -- the optimization seems to kick in even for operations of the form Discriminant(*ptr), so the validity invariant of ptr itself does not help either. It could be possible to strike some middle ground, but that feels like a rather ad-hoc adjustment to the current set of optimizations.

To summarize:

Cc @wesleywiser @tmandry @rust-lang/wg-mir-opt

tmiasko commented 2 years ago

The implementation in Miri ... reads the tag (which encodes the discriminant) and causes UB if that is uninitialized or does not encode a valid discriminant.

There seems to be an additional mismatch in terms of which discriminants are valid. The code generation considers discriminants corresponding to uninhabited variants to be invalid, but they do not seem to cause any errors in Miri. Does Miri validate a scalar range?

RalfJung commented 2 years ago

The code generation considers discriminants corresponding to uninhabited variants to be invalid,

How concretely does that assumption look like -- can you link to the code that does this?

Does Miri validate a scalar range?

Miri checks that a variant with the given tag exists. But Miri does not do further special casing of uninhabited variants.

For Niche-encoded tags, AFAIK uninhabited variants are not even necessarily assigned a tag, so Miri can never return such a variant. Similarly, if all but 1 variant are uninhabited, rustc will not use a multi-variant layout for this type -- so the other variants cannot possibly be returned either. (This cannot give rise to UB though, it simply means that there exists no sequence of bits that would be interpreted as encoding that variant.)

tmiasko commented 2 years ago

The code generation considers discriminants corresponding to uninhabited variants to be invalid,

How concretely does that assumption look like -- can you link to the code that does this?

The layout for tag describes a valid range, which is used to emit LLVM range metadata when loading the tag. For example, for a direct encoding:

https://github.com/rust-lang/rust/blob/81117ff930fbf3792b4f9504e3c6bccc87b10823/compiler/rustc_middle/src/ty/layout.rs#L1146-L1160

https://github.com/rust-lang/rust/blob/81117ff930fbf3792b4f9504e3c6bccc87b10823/compiler/rustc_codegen_llvm/src/builder.rs#L483-L485

RalfJung commented 2 years ago

which is used to emit LLVM range metadata when loading the tag

Ah, and then that load_operand function is called at https://github.com/rust-lang/rust/blob/5b2f757dae374e22a7733f90af482f405bd426e9/compiler/rustc_codegen_ssa/src/mir/place.rs#L232

Yeah that seems like there are a lot of special cases for uninhabited variants / enums built into the codegen backend that do not seem to fall out of any more general underlying principle -- except for "the entire value must be valid", but that is incompatible with things we do elsewhere.

JakobDegen commented 2 years ago

I don't really see how there's more than one option here. For Option<NonZeroU8> and any other enum for which we may want to do layout optimizations in the future (ie all of them), we have to ban calling discriminant on Some(uninit), since that's just uninit. That means that at least for the stable mem::discriminant of user code, I think the reasonable rule to choose is this:

With the exception of any fields, or parts of fields, that are behind an UnsafeCell, all validity invariants for the value and its fields must be upheld prior to calling this function.

For SB this can then be a read of those places. This lets this example continue to be DB.

Despite requiring near full validity, I do think this is effectively encoding the idea of "doing minimal work" - just in a way that is cognizant of enum layout optimizations. Now of course we could instead try and define the safety requirements of mem::discriminant in terms of the layout that is chosen, ie something like "all those fields for which at least a part of their niche is occupied must be valid" but I don't see how this benefits either us or users.

For the associated MIR rvalue, I suppose we could choose a weaker requirement (since layout information is actually available there), but I'm also not sure what the benefit of that is.

JakobDegen commented 2 years ago

Note that this doesn't necessarily have to make the elaborate drops code wrong, if we decide that moving out leaves the value initialized (not that I'm advocating for that)

RalfJung commented 2 years ago

Now of course we could instead try and define the safety requirements of mem::discriminant in terms of the layout that is chosen, ie something like "all those fields for which at least a part of their niche is occupied must be valid" but I don't see how this benefits either us or users.

That's not actually so bad -- it is, I think, basically what Miri currently does. One "just" implements the logic of determining the discriminant in the obvious way, and if any of the values considered in that process are uninit, we raise UB.

JakobDegen commented 2 years ago

Now of course we could instead try and define the safety requirements of mem::discriminant in terms of the layout that is chosen, ie something like "all those fields for which at least a part of their niche is occupied must be valid" but I don't see how this benefits either us or users.

That's not actually so bad -- it is, I think, basically what Miri currently does. One "just" implements the logic of determining the discriminant in the obvious way, and if any of the values considered in that process are uninit, we raise UB.

What are the benefits of this though? Users can't stably rely on this anyway, so the only way that they can make use of this is to do some cursed thing where they try and analyze the chosen layout - I'm not even convinced though that it's actually possible to do this for enums the same way it is for structs. Do we expect MIR optimization passes to do this sort of analysis? Do we even want them to? This is one of those instances where I expect that adding more UB to the language is going to decrease the amount of UB in the ecosystem, because it will reduce the chance that people do dangerous things for which there are probably better solutions.

I do recognize that this might be useful if we allow #[repr(u16)] or whatever on non-fieldless enums in the future (and make some layout guarantees along with that). In that case though, I think if/when T-Lang decides to do that, we should revisit these rules and loosen them for those cases

JakobDegen commented 2 years ago

oh, another relevant point here @RalfJung : Types don't have to have niches in order for us to want to read some of their bytes in relation to a discriminant call. Specifically, imagine that I have an enum like the following:

enum E {
    A(NonZeroU8, u8),
    B,
    C,
}

rustc could choose a layout like this:

A(x, y) => [x, y],
B => [0, 1],
C => [0, 0],

Now we could imagine a future in which we get enough features (eg a byte type) to implement mem::discriminant in MIR. In that case, we might want to optimize

x == E::B

into

(x as [byte; 2]) == [0, 1]

But if we guarantee that u8 in A is not read by the mem::discriminant call, then this optimization is incorrect. LLVM also won't be able to do it either, because it won't know that the read of the second byte can be hoisted out of first_byte == 0 condition. This, by the way, is why the UnsafeCell distinction is important. Because if we replace u8 with UnsafeCell<u8> we can no longer do this optimization

What all this means is that "the niche is partially occupied" would be an insufficient description of the requirements (assuming we want to enable this optimization). What we'd instead need is some condition that talks about the use of a field's bytes in conjunction with the niche in another field

RalfJung commented 2 years ago

What are the benefits of this though?

It's the least amount of UB we can have. That's IMO the default state and any extra UB needs justification. :)

I am not convinced that special-casing UnsafeCell in the safety contract of this operation is a good idea -- it is quite complicated, and it is still not enough ti justify what current codegen does. (UnsafeCell<!> is still uninhabited, but if I understand your proposal then it would be allowed to ask for its discriminant.)

JakobDegen commented 2 years ago

What are the benefits of this though?

It's the least amount of UB we can have. That's IMO the default state and any extra UB needs justification. :)

I am not convinced that special-casing UnsafeCell in the safety contract of this operation is a good idea -- it is quite complicated, and it is still not enough ti justify what current codegen does. (UnsafeCell<!> is still uninhabited, but if I understand your proposal then it would be allowed to ask for its discriminant.)

This last part is a good point, and I'm not sure what the best strategy for proceeding with it is. But my point above still stands. There are real (and probably non-trivial) optimizations that require us to read possibly anything that is not behind an UnsafeCell. What would the alternative rule that allows the above optimization even be?

RalfJung commented 2 years ago

LLVM also won't be able to do it either, because it won't know that the read of the second byte can be hoisted out of first_byte == 0 condition.

Actually it should be able to do that, since all our references are dereferencable.

JakobDegen commented 2 years ago

LLVM also won't be able to do it either, because it won't know that the read of the second byte can be hoisted out of first_byte == 0 condition.

Actually it should be able to do that, since all our references are dereferencable.

dereferenceable is not strong enough, since it's not enough to guarantee that the read doesn't race. See this example, where LLVM has to keep the select because the second load might have yielded poison (llc opts don't figure this out).

Edit: After more experimentation, this example is inconclusive because LLVM doesn't figure it out even with more metadata. Still though, I do think that dereferenceable is not strong enough, since it does not guarantee that the read isn't racey

RalfJung commented 2 years ago

Races are not a concern, in LLVM a read-write race just means the read returns undef. And yes LLVM might have to add a freeze to get around poison, but that's still very possible. (But I have no idea if it will actually do that. All I am saying is that it legally could.)

JakobDegen commented 2 years ago

Races are not a concern, in LLVM a read-write race just means the read returns undef. And yes LLVM might have to add a freeze to get around poison, but that's still very possible. (But I have no idea if it will actually do that. All I am saying is that it legally could.)

Oh this is a good point about the races, I thought they returned poison instead of undef. Still though, ideally LLVM would be able to combine the two loads into one, and it can't do that by freezing the result of the places. It might be able to do it by freezing the load itself (or something equivalent)

RalfJung commented 2 years ago

Yeah, it can do the 2-byte load, freeze that, and then compare with [0, 1].