Closed nikomatsakis closed 1 year ago
I agree with @rkruppe that in many cases UB is the safest choice since it allow us to define the behavior later.
I am not sure I fully understood @gereeter argument (maybe it can elaborate), but it appears that it seems to be stating that there are other options beyond defining and not defining some program behavior.
AFAICT, we can not define the behavior of something, e.g., if a null pointer is dereferenced, the program is illegal (the guidelines do not speak about the behavior of illegal programs). Anything else we do, makes it defined behavior:
panic!
sMaybe @gereeter point is that we should prefer defined behavior to undefined behavior ? If so, I fully agree, but in many cases, like in the pointer dereference case, we can't and that's ok (e.g., those who want to avoid it can stick to safe Rust, those using raw pointers are explicitly opting into that for "reasons").
Also, because dereferencing a null pointer is undefined behavior, we could have a "fortified" build mode that panic!
s on null pointer dereference, miri can offer a stack trace, etc. I'm with @RalfJung here that no matter where set the bar for UB, it should be possible to instrument a binary such that undefined behavior is never invoked.
From the other thread
there are lots of other notions of "incorrect": undef/varying value, poison, frozen poison/arbitrary value, some sort of sticky poison that on creation also marks some memory locations as permanently inaccessible, etc.
It seems likely that LLVM will move to just one kind of "incorrect value", and that is poison. The "floating" undef has lead to no ends of problems, and many optimizations are unsound, some of them leading to real miscompilations.
So, I think we should restrict ourselves to two possibilities when we want to declare an operation illegal:
poison
(which miri unfortunately calls Undef
).LLVM basically uses the latter whenever it can get away with that, as that leads to easier code motion optimizations. However, when specifying a surface language, that is not our concern, and hence it is usually preferrable to declare UB -- the optimizer then still has the choice of actually "just" considering this as returning poison
. The only reason we even have Undef
in our "abstract machine" at all is that we have to handle reading uninitialized memory, and for practical reasons we cannot make that UB.
@gereeter proposes another form that's somewhat "in the middle": Weak UB taints the "future" of the execution, but cannot "travel back in time" like C's Strong UB can. Adding another kind of UB complicates the specification (and constraints optimizations and codegen to LLVM, because LLVM does not have anything resembling weak UB), so I'd like to see strong reasons for why the two kinds we have are not enough.
Now, @gereeter provides some good arguments, but I think those goals are fairly unachievable. For example:
Security. If secrets like passwords and keys are stored in some volatile memory and properly cleared after use, it would be useful to know that some bug in unrelated code long after the sensitive computation is complete can't expose the secrets.
It is an open problem, as far as I know, how to specify a language that can be both optimized and provide reasonable guarantees for "cleaning" memory. Our work here is hard enough without adding additional unsolved problems to it, so I am worried that we are adding too many things on our plate here.
Also, this as well as the other arguments only work out if all UB is weak. So, I don't think this adds another choice per illegal operation. It just adds one global option to our semantics: Whether "UB" means strong UB or weak UB.
If we restrict ourselves to strong UB now, we can always offer a weak UB mode later once we actually have codegen that can support this. That is something which might be possible cranelift codegen backend for debug builds, but as long as we use LLVM, I think that is ways off.
@RalfJung can you say a bit more about what "poison" means in LLVM (vs "floating undef" or whatever the alternatives are)?
"poison" is like the Uninit
in my blog post: Any byte (or maybe bit, that detail is not determined yet) is either some "normal" value, or "poison". Copying data with "poison" in it around is fine, storing it to memory and loading it from memory happens as expected and preserves "being poison". (It is an open question what happens when you load an i32
and some bytes/bits are poison -- does the entire i32
become poison, or does it preserve byte/bit-level precision?) All other operations "propagate" poison where possible: poison * 0
is poison, posion + 5
is poison, and so on. If poison
is used for a conditional jump, we have UB.
Together with poison comes an operation freeze
which takes in any value, and returns a non-poison value. It does this by non-deterministically choosing anything when the input is poison
. So while if poison
is UB, if freeze(poison)
is non-deterministic. freeze
is a way to "decontaminate" a possibly poisoned value (with no side-effect on data that was not poisoned to begin with). This is very useful for some optimizations that turn arithmetic into conditional jumps, and must not introduce UB. It is exactly the ingredient that LLVM was missing, and that lead to it having undef
.
"floating undef" (the undef
of LLVM) is a whole different beast. LLVM with "floating undef" actually has an even more complicated notion of a byte or value: A value is a set of options. "Normal" values are singletons sets, but undef
is the set of all values. undef * 2
is the set of all even values. undef == undef
is the set {0, 1}
. Basically, for all operations, the result is the set of values that can be obtained by performing the operation with any of the possible choices from either operand.
Note that what I said so far is fully deterministic! At no point are we making a choice. Instead, we track all possible choices in one execution. Non-determinism enters the picture once you do a conditional jump on a non-singleton value: Then you non-deterministically pick an element from the set and use that.
This model has a host of problems and some optimizations that LLVM performs are unsound for it. It is also unclear what happens when you store a multi-byte non-singleton value into memory: Do we somehow track the correlation between those bytes (say the value is {0^32, 1^32}
, i.e. it is either all-0 or all-1)? None of these questions have ever been answered in a consistent way, to my knowledge. See this paper for more details. It has long been proposed to remove undef from LLVM, and there seems to be general agreement except some of the questions around posion
that I mentioned above still need to be worked out.
@RalfJung thank you for writing this. When talking about freeze
, you mention:
This is very useful for some optimizations that turn arithmetic into conditional jumps, [...] It does this by non-deterministically choosing anything when the input is poison
Could you maybe elaborate a bit more on how freeze
works? I wonder whether it just "forgets" poison, whether it preserves attributes, what guarantees I get on the result value, etc. For example, say I have a nonnull poison pointer and I launder it with freeze, do I just get a non-poison pointer back, or does it preserve the nonnull attribute so that I get a nonnull pointer back? If I get a nonnull pointer back, am I guaranteed that dereferencing this pointer won't introduce undefined-behavior of the type null-pointer dereference? That is, does freeze
make sure that the bit-pattern of the non-poison pointer is not null ? Also, I wonder whether it is possible to "hook" freeze to e.g. panic if the value passed to it was poison.
Poison is compile time fiction, it is used in the language semantics to justify some code transformations, but it is not a distinct value at run time (unless one deliberately builds a VM like miri that detects UB rather than exploit it). I wouldn't go as far as saying the choice between poison, undef, and instant UB has no impact on the compiled machine code at all, but it certainly isn't a real real value the program could detect and handle specially.
For example, say I have a nonnull poison pointer and I launder it with freeze, do I just get a non-poison pointer back, or does it preserve the nonnull attribute so that I get a nonnull pointer back
It is not immediately clear to me whether it is or should be instant UB if a value annotated as nonnull
(or as being in a certain range, or not aliasing other pointers, or other such properties) turns out to be poison, but placing restrictions on the possible outcomes of freeze
based on attributes added to it elsewhere seems nonsensical to me.
Attributes such as nonnull
are intended to enable optimizations, not hinder them, but this would force worse codegen (e.g. for non-null, the code generator couldn't just map the freeze
value to any register or stack slot without initialization, it would have to emit runtime code to ensure that location contains a non-zero value). Furthermore, these attributes can freely be dropped or ignored or moved around by the compiler (because they're only intended to enable optimizations), so "a nonnull pointer" is not even something the compiler can or even wants to reliably detect in general.
Another angle is, if you handle poison values, you're already hovering one centimeter over the pit of lava that is UB, and the only reason you don't fall in is you promised you won't try to get too smart with the garbage value. If this clashes with wanting to assume the value is well-behaved in certain ways, well, maybe stop assuming that, or don't handle garbage values, or manually fix up the result of freeze
to conform to the assumption (e.g., see if it's null and change it if it is).
Not sure what a "nonnull poison value is". Attributes are attached to variables/function arguments, not values -- and poison is a value.
The semantics of freeze(poison)
is to non-deterministically pick a non-poison bit pattern (the size of which is determined by the type of this operation). Notably, this is the only non-determinism in this scheme: There is no non-determinism (in the abstract machine!) when allocating new memory (it is deterministically initialized to be all-poison), and there is no non-determinism when performing an arithmetic operation on poison (the result is deterministicially posion).
Wrt. attributes, I do not know whether it is UB to pass "posion" for an argument with the nonnull
attribute. This is something the LLVM devs would have to answer. But my guess is "yes" -- basically, I would expect that "posion" satisfies no attribute, and @rkruppe spelled out all the reasons why :)
Discussed in backlog bonanza, closing. The answer to the question in the title is probably "yes," but there's no super clear compelling motivation. People are welcome to open new issues if there are specific problems they are facing
Note that C is proposing it: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3128.pdf
We discussed that in the meeting (not the C proposal itself but the rust equivalent), there seems to be some disagreement about to what extent UB honors observable behaviors. (In particular, if one annotates a syscall or other external function call with a mustreturn
style annotation, it is possible that UB after it may propagate back past the IO.) My own interpretation would favor what is called "concrete UB" there: any IO events prior to UB are part of the observable behaviors and will be preserved, it is only the last stretch of unobservable computations prior to UB that is compromised. The answer to this question is tied up in our nondeterminism model. It is probably best to use a new issue.
That C proposal seems rather limiting, e.g. one can no longer move memory accesses up across print
calls that definitely return. I am curious if C compilers will really implement that.
For Rust it's probably less bad since references give us stronger guarantees and earlier UB. But anyway I agree with Mario, we shouldn't have different kinds of "UB" but we might want to have an issue tracking whether our UB can exhibit "time travel" or not.
Ah, it wasn't clear to me that there was actual disagreement here. My impression is that we can't actually consider anything other than time-travel UB because LLVM won't let us (although I'm having trouble getting it to do even obviously correct optimizations, so who knows)
Well, UB "time-travel" is partially a result of colloquial misinterpretation of the execution semantics. UB time-travels past unobservable computations because those computations are not temporally sequenced in the first place, they are more like nodes in a data dependence graph. The kind of time-travel that is actually observable would be traveling past IO, and I would say that for the most part we don't allow that, except in specific (TBD) enumerated circumstances. If you call an opaque function and then do a UB the function call will definitely happen, because it might just abort the program.
Right, but not all observable behaviour is entirely opaque - and as discussed, if the thing being implemented is the abstract machine (and not an emulator of the abstract machine, so something closer to miri), then it knows exactly when the program terminates or reaches a UB state (and the AM doesn't necessarily track the prior observable behaviour and may simply discard it when it halts with an error).
I opened an issue for time-traveling UB: https://github.com/rust-lang/unsafe-code-guidelines/issues/407
In #5, @gereeter raised the point that defining all manner of errors as yielding "undefined behavior" is an awfully strong statement. It theoretically permits the compiler to "change the past" and may not mesh so well with the way that users think. On the other hand, weaker definitions may not permit the kinds of optimizations we want and may not be supported by LLVM.
In a sense, this is a cross-cutting concern: we need to figure out what's allowed and not allowed, but separately, we should consider if there are cases where we can contain the repercussions.
I'm not sure the best way to handle this, but I'm opening this up as its own potential discussion topic for the future.