rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do
https://rust-lang.github.io/unsafe-code-guidelines
Apache License 2.0
660 stars 57 forks source link

Validity of integers and floating point #71

Closed RalfJung closed 1 year ago

RalfJung commented 5 years ago

Discussing the validity invariant of integer and floating point types.

Clearly, every possible bit pattern is allowed. For integers they all have a distinct and meaningful interpretation, and we have a safe NOP-conversion between f32 and u32, and f64 and u64, through to_bits and from_bits.

The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value? We could reasonably decide either way. Also, when an integer is partially uninitialized, does that "infect" the entire integer or do we exactly preserve which byte is initialized?

2022-09-07: This has now pretty much been answered.

RalfJung commented 5 years ago

Arguments for allowing uninitialized bits:

Arguments for disallowing uninitialized bits:

RalfJung commented 5 years ago

Here's an example of a piece of code that relies on uninitialized integers (and raw pointers, and AtomicPtr) being valid: https://github.com/carllerche/bytes/blob/456221d16521cf54cea0e6569669e47120a1b738/src/bytes.rs#L1810

RalfJung commented 5 years ago

An interesting use of uninitialized bits in integers is in crossbeam's AtomicCell: for types of certain sizes, it will use the AtomicU* types.

But this means that with a type like (u8, u16), it will use AtomicU32 and thus carry uninitialized bits in a u32. Also see https://github.com/crossbeam-rs/crossbeam/issues/315.

Cc @stjepang

gnzlbg commented 5 years ago

If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question.

If we were to allow uninitialized bits, it might be reasonable to say, at least initially, that any operation that's not a read or a write is UB. That would allow us to define those operations later on. For example, that would mean that adding an initialized integer with an uninitialized one is UB, but later on, we could define that to something else, like the result of that operation being uninitialized.

That is, we wouldn't need to answer these hard questions right now.


I wonder whether it is possible to allow uninitialized bits later on in a backwards compatible way or whether we do have to make this decision right now ?

I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption.

In the same way, if we allow uninitialized bits, we can't later on disallow them without breaking code that uses them.


"No uninitialized data outside of unions" is a nice and and easy to teach principle.

Agreed.

A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately.

I don't think it can flag every occurence, for example, extern fn foo() -> MaybeUninit<i32>, but it would be able to do so for all code that the tool is able to instrument at least.

hanna-kruppe commented 5 years ago

I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption.

I don't see how. If a type's validity invariant rules out certain initialized bit patterns, unsafe code can use those bit patterns for its own purposes, which will clash with later allowing those bit patterns in the validity invariant. However, uninitialized bits -- even if they are valid -- cannot be detected by the program: reading them is either UB or perhaps produces some string or zeros or ones (possibly a non-deterministic one). So that kind of counter-example is right out.

Moreover, because uninitialized bits will never be safe, unsafe code can't be fed them from unknown/external sources. IOW the only way a library/module/etc. will ever see uninitialized bits is if it produces them itself or obtains them from its "trusted base" (e.g., a function whose functional correctness is relevant to the soundness of the library/module/etc.), in which case it's UB today and its problem is an internal bug, not the interface with the rest of the world.

nikomatsakis commented 5 years ago

I would be inclined to permit uninitialized data in integers.

My reasoning is as follows:

I think that backwards compatibility around mem::uninitialized is a real concern. It is a very common pattern to allocate an uninitialized [u8] array in practice, and I am reluctant to declare all of that pre-existing code UB (even if we deprecate uninitialized).

Second, I think that the crossbeam example which @RalfJung raised earlier feels like the kind of tricky thing that people shouldn't have to fret about. In particular, if you have uninitialized padding bits in structs and things like that which are known to be less than word size, I think you should be able to treat them as integers for convenience, and it seems like that would be Insta-UB under this proposal.

Basically, I think people are likely to do things (like crossbeam) that wind up using uninitialized bits in integers but which don't necessarily "feel" like cases where MaybeUninit should be required. I also think that people will commonly reason about uninitialized data informally and hence write code that is UB -- of course, a tool that detects such usage can help, but it's also nice if "common things" people write are not UB even if they neglect to run the tool.

That said, I find the most compelling argument against permitting uninitialized bits to be that it would allow us to declare such uses an error. But it seems like we could still have a sort of "lint", where we say "ah, you are using uninitialized data outside of a MaybeUninit struct -- while not technically UB, it is recommended to use MaybeUninit.

One meta-question:

Suppose that I do have a clever algorithm that makes use of uninitialized integers. For example, a trick I have used in a past life was to have an integer set that had O(1) initialization cost regardless of its capacity. This worked by having two vectors of integers. One of them started out with size N but had uninitialized data. The other started out as size 0 but had initialized data. To add to the set, you checked one against the other (I can give details if desired). The key point is that you had to read and use an uninitialized integer in the process and compare it against initialized data -- is reading such an integer UB?

hanna-kruppe commented 5 years ago

The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits. So if we want to allow them, we need to not only allow reading uninitalized memory but also make arithmetic and comparisons on them defined (and reasonably deterministic! undef results are worse than useless). That is not possible at all with current LLVM (except by initializing all memory, or pretending to do so in ways that will block lots of unrelated optimizations) and even if/when freeze and poison is adopted, it'll have some negative codegen impact and I see no way to restrict that impact only to modules that really need it.

So even though I agree that it would be best to support these things, I do not see a reasonable way to achieve that.

RalfJung commented 5 years ago

The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits.

Notice that this only applies to the compare_exchange part of crossbeam. Loads and stores do not operate on uninitialized data and hence are mostly fine.

So we have a two-level discussion here:

I'd prefer the answer to the first question to be "no" so that the second question doesn't even need answering, but unfortunately @nikomatsakis has some pretty good arguments. ;) Just one word on these: at least under some proposed poison semantics for LLVM, poison acts on a value. That means if you load a (u8, u16) into a u32, if the padding was indeed uninitialized, you now have a fully uninitialized u32 -- if one byte of the memory is poison, the entire u32 becomes poison. So that would still not allow was crossbeam does. This behavior is, from what I know, one of the reasons why the proposal isn't officially adopted yet -- but if you don't do this, you lose some other optimizations. However, this question of the "granularity" of poison/uninit (per-value, per-byte, per-bit) is somewhat orthogonal to whether or not a u32 must be initialized, and there are uses of uninitialized integers that never run into this.

Coming to the second question, as @rkruppe said these patterns (I think you are talking about https://research.swtch.com/sparse, right?) are still not legal. Comparing an uninitialized integer with anything is either UB or produces an uninitialized boolean, branching on which is UB. But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that.

hanna-kruppe commented 5 years ago

But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that.

Hm, if getting people to use such an intrinsic is acceptable (I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking), then we can already build such an intrinsic today, it just has to do something that all LLVM optimizations have to assume could initialized the memory (e.g., some inline asm). This will have some unfortunate impact on optimizations unrelated to uninitialized memory, but it will still be localized to uses of that intrinsic.

RalfJung commented 5 years ago

I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking

Fair, but I see no model that makes this work.

This is actually also a reason why I'd prefer to not allow uninitialized data in integers -- that may be more surprising, but it is easier to explain and very concise: "No uninitialized data outside MaybeUninit".

If we allow uninitialized bits but then almost all operations are UB, it becomes something more like "No uninitialized data outside MaybeUninit, except if it is an integer or a raw pointer and you are not actually looking at the data, just loading and storing" and then we still have to explain that x * 0 is also UB even though it doesn't really have to "look at" x -- and then people will be "screw this, that's too complicated, I will just do whatever and write some tests".

Basically, we are breaking expectations anyway, and maybe it is better to break them more but in simpler ways, than to figure out how to break them as little as possible while still breaking them in ways that are much more complicated to explain. That seems plausible to me. Not sure if it makes any sense.^^

CAD97 commented 5 years ago

If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation?

It seems like it should be possible to use an atomic (u8, u16) on a platform supporting 32 bit atomics. The only way I can see without requiring valid(T) \in initialized(u32) (and thus a very unsafe AtomicCell) is to provide some sort of mem::initialize_padding(&T).

I'm personally on the side of making uninitialized integers invalid so long as we don't lose anything (but mem::uninitialized) for it.

Amanieu commented 5 years ago

The case of AtomicCell<(u8, u16)> does indeed depend on UB at the moment. C++ solved this by essentially making it the responsibility of std::atomic<T> to magically ignore padding bytes (see this page, at the bottom). Note that this magic only works for structs: compare_exchange_strong is not guaranteed to ever converge with unions.

I've been toying with the idea of a clear_padding_bytes intrinsic that would support this use case, as well as things like writing a struct to a file without leaking information through the padding bytes:

/// Writes 0 to all padding bytes in `val` and returns a mutable slice of the bytes in `val`.
fn clear_padding_bytes<T>(val: &mut T) -> &mut [u8];
RalfJung commented 5 years ago

If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation?

No.

Notice though that the state of this trick (even the load/store part) is dubious in LLVM as well -- poison is infecting the entire value when just one of the bytes loaded from memory is poisoned, and there are proposals to replace undef by poison.

C++ solved this by essentially making it the responsibility of std::atomic to magically ignore padding bytes

Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution. One possible solution is to say that values get frozen before being compared, that at least removes the UB -- but it doesn't guarantee that a compare-exchange loop will ever terminate as they could get frozen to a different value each time. Or maybe they actually get frozen in memory, but that conflicts tons of real optimizations.

Writes 0 to all padding bytes in val and returns a mutable slice of the bytes in val.

I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes. But a version of this which just picks arbitrary bit patterns for all uninitialized and padding bytes would be implementable, that's exactly what freeze does.

Amanieu commented 5 years ago

I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes.

I don't see what the problem is? The compiler knows the layout of type T and therefore knows where all the padding holes are.

Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution.

In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation. This will cause padding bytes to be "ignored" by the compare_exchange hardware instruction since they will always be 0.

RalfJung commented 5 years ago

The compiler knows the layout of type T and therefore knows where all the padding holes are.

Oh, you mean statically -- well, for enums that'll require dynamic checks as well. And of course this is hopeless for unions.

In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation.

That would also guarantee that the value you read has 0 for all padding bytes, which I am fairly sure they do not want to guarantee. And anyway I see no way to implement this behavior even remotely efficiently.

I think a more reasonable operational version amounts to saying that you freeze both values before comparing -- that at least avoids comparing uninitialized bytes, and it makes compiling to a simple CAS correct. But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop.

Also this assumes the atomic operation knows the correct type, whereas AFAIK for LLVM atomic operations only work on integer types -- at which point you cannot know which bytes are padding.

ghost commented 5 years ago

But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop.

I believe spurious comparison failures can be fixed like so:

// Assume `T` can be transmuted into `usize`.
fn compare_and_swap(&self, current: T, new: T) -> T {
    // Freeze `current` and `new` and transmute them into `usize`s.
    let mut current: usize = freeze_and_transmute(current);
    let new: usize = freeze_and_transmute(new);

    loop {
        unsafe {
            // `previous` is already frozen because we only store frozen values into `inner`.
            let previous = self.inner.compare_and_swap(current, new, SeqCst);

            // If `previous` and `current` are byte-equal, then CAS succeeded.
            if previous == current {
                return transmute(previous);
            }

            // If `previous` and `current` are semantically equal, but differ in uninitialized bits...
            let previous_t: T = transmute(previous);
            let current_t: T = transmute(current);
            if previous_t == current_t {
                // Then try again, but substitute `current` for `previous`.
                current = previous;
                continue;
            }

            // Otherwise, CAS failed and we return `previous`.
            return transmute(previous);
        }
    }
}

Now it's still possible to have a spurious failure in the first iteration of the loop, but the second one will definitely succeed (unless the atomic was concurrently modified). In fact, this is exactly how CAS in AtomicCell<T> already works, except we don't have a way to freeze values.

RalfJung commented 5 years ago

previous is already frozen because we only store frozen values into inner.

I think this is the key invariant here -- if you can make that work, then yes I can think there is a consistent semantics here. This requires making sure the contents are initialized, and also freezing before writing in store (and any other modifying instruction).

However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.

gnzlbg commented 5 years ago

We seem to be trying to answer two different questions here that are entangled.

One is whether integers and such can be uninitialized or not.

The other one, which is most fundamental, is whether we want to support using uninitialized memory outside unions or not, in general. I think we should answer this question first, and use the answer to drive the rest of the design.

We can't think of allowing uninitialized memory on integers without also considering that the validity of integers and raw pointers is going to be alike, so we are also allowing uninitialized memory on raw pointers. Raw pointers can point to DSTs, so we also need to be thinking whether we want to allow uninitialized memory in pointers to DSTs (for the whole pointer, some part of it, etc.).

I agree with @nikomatsakis that a lot of code is using mem::uninitialized today and we should weight the impact on that. I don't share their reluctance (yet), because I think deprecating mem::uninitialized doesn't break the world (the compiler and tools are going to be the same as the day before the deprecation). People code will continue working "as is", and they will have time to upgrade. I don't know what the cost of the upgrade will be, but reasoning about mem::uninitialized is hard, and the rule "no uninitialized memory outside unions" turns something that is hard into something that's relatively easy. It also potentially simplifies our "rules" for all other types (e.g. raw pointers, integers, etc.), and there is value in that.

The issue that some algorithms require uninitialized memory has shown up. The question that hasn't been answered yet AFAICT is whether MaybeUninit can be used to implement those algorithms or not. If it can't, then IMO that might hint at a design flaw in MaybeUninit (or maybe we need more helper types, e.g., AtomicMaybeUninit or similar), but that doesn't necessarily imply that banning uninitialized memory outside unions is a bad choice.

ghost commented 5 years ago

However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.

Just to clarify for everyone else following this thread: @RalfJung and I discussed this and the conclusion was that we'll simply remove get_mut() if that is necessary to make AtomicCell sound.

RalfJung commented 5 years ago

About whether uninitialized data is okay in integers at all: we discussed this at the all-hands.

The general consensus seems to be that we should permit uninitialized data in integers and raw pointers. There is just too much existing code doing stuff like

let x: [u8; 256] = mem::uninitialized();
// go on

or

let x: SomeFfiStruct = mem::uninitialized();
// go on

Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri.

Incidentally, that matches what Miri already currently implements, mostly for pragmatic reasons (libstd already violated the rules about uninitialized integers when I wrote the checks -- I think I have since moved it to MaybeUninit).

gnzlbg commented 5 years ago

Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri.

Was it also discussed whether this was worth the benefit of a simpler and more teachable correctness model for unsafe code ? (independently of whether this model can be better checked with miri or not?).

think I have since moved it to MaybeUninit).

I think so too.

RalfJung commented 5 years ago

Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?

gnzlbg commented 5 years ago

Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?

No, but I do think that "uninitialized memory is only allowed inside unions" is much much simpler and teachable than all other alternatives that are currently being discussed.

CAD97 commented 5 years ago

Another (drastic?) option to consider that allows let x: u32 = mem::uninitialized();:

Disallow uninitialized (poison) bits in integers (and pointers?). Make mem::uninitialized return freeze(poison). This makes mem::uninitialized behave as most initially intuit (a constant nondeterministic bit string) and gains us the "(true) uninitialized memory only inside unions (or padding)" property.

This may degrade some performance around uses of mem::uninitialized, but the intent is to move those over to mem::MaybeUninit anyway, right? And as previously mentioned, mem::uninitialized is a very dangerous tool to start with.

Also it could degrade talking about "uninitialized memory", because it then introduces both the "frozen uninitialized" behind mem::uninitialized and "poison uninitialized" behind mem::MaybeUninit and padding as being called "uninitialized" in the language. This could be just documented on the deprecated big-fat-warning-label mem::uninitialized for most of the points back, though. Example:

mem::uninitialized

Deprecated, do not use. Use mem::MaybeUninit for its clearer semantics instead.

For safety reasons, this function now returns "frozen" memory -- memory set to nondeterministic but consistent bits -- rather than truly uninitialized memory. Otherwise, merely storing the returned value in a variable would immediately break the validity invariants of any type except mem::MaybeUninit.

RalfJung commented 5 years ago

@CAD97 Good idea! Basically, once we have https://github.com/rust-lang/rust/pull/58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze. That would make all the existing code that creates uninitialized integer arrays or FFI structs valid even if we decide that integers must be properly initialized. (I guess we could say we require them to be "frozen".)

@Amanieu suggested we should allow uninitialized values in integers only for "backwards compatibility reasons". This would have a similar effect. It might, however, incur a performance cost on such existing code. Porting code to MaybeUninit enables it use "unfrozen memory" and gain back the performance.

However, I suspect people will still want to keep memory unfrozen when e.g. calling a known Read::read implementation. That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)

gnzlbg commented 5 years ago

Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze.

We should definitely do this.

That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)

+1.

There is just too much existing code doing stuff like

let x: [u8; 256] = mem::uninitialized();

The freeze fix makes this "work" for Rust 2015 and Rust 2018, but unless I missed something the plan is still is to deprecate mem::uninitialized. If that's the case, we could prevent Rust 2021 crates from accessing it (even though the std lib still needs to ship it for inter-edition compatibility).

So I find the argument that "integers and pointers shall support uninitialized bit-patterns because there is too much code using mem::uninitialized()" weak. In Rust < 2021 uninitialized would be changed to freeze, and in Rust >= 2021 there could be no mem::uninitialized.

CAD97 commented 5 years ago

What about let x: [u8; mem::size_of::<T>()] = mem::transmute(t);? That's the other way to easily accidentally put padding bytes in an integer.

Do we care about that use pattern? (What's the sound equivalent if we don't allow this? [MaybeUninit<u8>; mem::size_of::<T>()]? The sooner MaybeUninit stabilizes the better.)

RalfJung commented 5 years ago

As another example for a use-case for moving around (references to) uninitialized integers: https://github.com/tbu-/buffer

The longer this goes on the more I get convinced that if we can avoid making things UB, we should. And if we can make something a requirement only "when the data actually gets used", we should (see https://github.com/rust-lang/unsafe-code-guidelines/issues/133 for a similar problem in a totally different context). Both of these point towards not making uninitialized bits in an integer insta-UB.

RalfJung commented 5 years ago

In particular, when a disappointed unsafe code author asks "but why oh WHY did you make this UB", I think it is good to be able to point to something. And for ruling out integers with uninitialized bits in them, we can't point at a lot.

From what I remember, we have "The formal semantics are nicer" and "Some sanitizers get more effective". I expect many people will not be impressed... worst-case, they'll just say "well but it works in practice so I will just ignore your UB" (and to my knowledge, it will work in practice, even if we rule it out).

gnzlbg commented 5 years ago

I think at this point I'd like to see the wording for the "integers with uninitialized bits are not UB" case.

The main advantage I see of making integers with uninitialized bits UB, is that miri and similar tools will point users at the source of the UB: where did that integer become uninitialized.

Without this, we have to define what are users allowed to do with an uninitialized integer and what they are not allowed to do. A tool will trivially report UB when the operations that are not allowed happen, but tracking and pin pointing to the point in the users source code that made that integer UB can be tricky.

they'll just say "well but it works in practice so I will just ignore your UB" (and to my knowledge, it will work in practice, even if we rule it out).

A huge downside of making uninitialized integers UB is that there is no way for us to, e.g., diagnose that in debug builds. We can't insert debug_assert!s or similar to check for that.

gnzlbg commented 5 years ago

Wait, I think this is wrong:

The main advantage I see of making integers with uninitialized bits UB, is that miri and similar tools will point users at the source of the UB: where did that integer become uninitialized.

How would miri diagnose, e.g.,

let mut v = Vec::<u8>::with_capacity(100);
unsafe { v.set_len(100) } // UB

If uninitialized integers are UB, then the Vec::set_len invokes UB because it creates 100 uninitialized integers, but it does so by changing the Vec's len field, so how would miri diagnose that ? That appears impossibly hard.

hanna-kruppe commented 5 years ago

My understanding is that miri wouldn't check validity of the Vec's elements int that code snippet (regardless of whether in that specific case that includes checking for uninitialized-ness). It might when coercing the Vec to a slice (depending on how we define validity for references), but certainly not just from changing the len field.

Lokathor commented 5 years ago

That's a bit of an interesting side question: Vec::with_capacity is calling RawVec::with_capacity which calls alloc on the generator that the RawVec is using. And the return pointer from alloc "may or may not have its contents initialized.", so isn't with_capacity at fault there? If there's any fault to be had at all, that it.

gnzlbg commented 5 years ago

@rkruppe

My understanding is that miri wouldn't check validity of the Vec's elements int that code snippet (regardless of whether in that specific case that includes checking for uninitialized-ness). It might when coercing the Vec to a slice (depending on how we define validity for references), but certainly not just from changing the len field.

So IIUC, set_len doesn't create any values. When the *mut T inside the vector is dereferenced is when the values are created (e.g. that would happen if one accesses the vector by index: v[0]). So if uninitialized integers are not ok, dereferencing the pointer would trigger the UB. If uninitialized integers are ok, dereferencing the pointer would not trigger the UB, but then taking a reference to the value would.

RalfJung commented 5 years ago

I think at this point I'd like to see the wording for the "integers with uninitialized bits are not UB" case.

One of my points is that the burden should generally be on whoever wants more UB to justify their desire. ;) The default should be to err on the side of less UB, all else being equal.

That said, my case is basically: the benefits are small, but there is a real cost.

On the benefit side, we don't get any new optimizations. We get slightly better "blame tracking" for when an incorrect use of an uninitialized integer does cause UB, but only with a very expensive analysis (valgrind/Miri). We also get slightly simpler formal rules. Don't get me wrong, I love simpler formal rules, but they don't outweigh any possible other consideration. ;)

On the cost side, we have the problem that many programmers have the intuition, for better or worse, that "any bit pattern is allowed in an integer", and hence you can e.g. transmute any 8-byte-sized datum to u64 for purpose of atomic accesses or serialization or whatever. Where possible, there is a lot of benefit in aligning our rules with programmer's intuition. Moreover, it's also useful to be able to use integers for small fixed-size chunks of arbitrary data: the alternatives are either unergonomic (MaybeUninit<[u8; 8]> or so) or non-existent (for atomic memory accesses). And there is also the cost of people basically giving up on UB and considering it a "purely theoretical concept" if we tell them that yes, their code has UB, but no, there's no way that will break their code (because no optimizations actually exploit it).

Also see Niko's case here.


Without this, we have to define what are users allowed to do with an uninitialized integer and what they are not allowed to do.

Literally anything except for a memcpy. Which in terms of MIR means: any cast, as well as applying any unary or binary operand to it, and using it as an index in a slice/array access.

A tool will trivially report UB when the operations that are not allowed happen, but tracking and pin pointing to the point in the users source code that made that integer UB can be tricky.

You mean "made the integer undef", I assume? So basically, you are saying that "blaming" whoever is responsible for creating the "bad" integer is harder when we only disallow it once an actual operation happened? I agree. That's what I meant above when I referred to "more effective sanitizers".

If uninitialized integers are UB, then the Vec::set_len invokes UB because it creates 100 uninitialized integers, but it does so by changing the Vec's len field, so how would miri diagnose that ? That appears impossibly hard.

Uh... I have no idea how you got there. Our memory is untyped. Only operations are typed. When you change the length of the vector, that's the only thing that happens. When you actually access the vector, that's when you "create an integer" (you are loading a range of bytes from memory at integer type).

The following code does not have UB:

let v: Vec<!> = Vec::new();
v.set_len(100);
mem::forget(v);

So IIUC, set_len doesn't create any values. When the *mut T inside the vector is dereferenced is when the values are created (e.g. that would happen if one accesses the vector by index: v[0]). So if uninitialized integers are not ok, dereferencing the pointer would trigger the UB. If uninitialized integers are ok, dereferencing the pointer would not trigger the UB, but then taking a reference to the value would.

Exactly. :) Except for the "taking a reference" part. If unintiailized integers are valid, then certainly are references to uninitialized integers.

But then adding 2 to that integer is UB.

gnzlbg commented 5 years ago

Exactly. :) Except for the "taking a reference" part. If unintiailized integers are valid, then certainly are references to uninitialized integers.

Yes, that's wrong, @rkruppe raised it on Discord. I was thinking "reference to uninitialized is invalid", but that's not correct if uninitialized is valid.

RalfJung commented 5 years ago

Looks like the widely-used bytes crate also deals with references to uninitialized integers, which would be made legal by deciding uninitialized integers are legal (taking some load off the reference discussion). [Source]

However, another thing I realized is that if we allow uninitialized integers, then calling this the "initialization invariant" becomes... weird.^^

RalfJung commented 5 years ago

Given that on Itanium, "uninitialized" (NaT, not-a-thing) registers can cause traps and whatnot, I wonder if saying that integers have to be initialized would be necessary and/or sufficient to make Rust compatible with hardware that can trap when using uninitialized registers the wrong way.

MaybeUninit<u*> can still hold uninitialized data, so it is not clear that disallowing this only for integers helps. And conversely, I suppose the compiler could make sure that the program does not ever see a "NaT" on Itanium, though I am not sure.

hanna-kruppe commented 5 years ago

If we ever have to care about an Itanium-like design (I really hope we won't), I don't think the resolution of this question will have any impact on that. As you note, "uninitialized data" is definitely allowed from other sources (and LLVM declares it legal to store undef/poison to memory, too) and so "uninitalized data" must not be NaT in any case. From the other direction, if necessary it should be possible to generate code that initialized possibly-NaT registers where necessary, and hopefully the costs of this or other workarounds will persuade architects that this is a misfeature.

gnzlbg commented 5 years ago

Since there appears to be consensus that uninitialized integer and floating-point values should be valid, I'm marking this as writeup-needed.

Ideally, the first step of the write up would be the rationale section. There has been a lot of discussion here and on internals, so we need to go through all of it, collecting pros / cons for allowing / not allowing valid uninitialized integers, and summarize those up.

Ideally we'll post such a summary here first so that everybody can see it and make sure that we do have strong consensus about uninitialized integers being valid.

Afterwards we can write down the value representation relation for these types.

RalfJung commented 5 years ago

One thing I am not sure we talked about: if we allow uninitialized bytes in integers, do we specify them to preserve which bytes are uninitialized? Like, when reading a u32 of which 3 bytes are initialized but 1 is not, and then storing it elsewhere, are the same 3 bytes initialized now or are all bytes uninitialized?

Some proposals for poison in LLVM involve making the entire integer poison if any input byte is poison; that would force us to say that the entire integer is uninitialized if any input byte is.

(EDIT: stupid GH decided to submit my post)

gnzlbg commented 5 years ago

Good question. I don't know, but one consequence of either is that, if a single uninitialized byte makes the whole integer uninitialized, the values that integers can take are [0..int::MAX, uninitialized]. If it doesn't, what would happen? For a given fully-initialized integer, would the different combinations of uninitialized bytes of it that one could construct be distinct values?

hanna-kruppe commented 5 years ago

Is there any reason not to go with "if any of the bytes in memory is uninit, the whole integer value read out is uninit" for now? It seems strictly more conservative (in particular re: LLVM poison semantics), and since programs can't test whether something is uninit (other than by causing UB if it is), I expect it would be backwards compatible to move to partially-uninit-integers later.

RalfJung commented 5 years ago

For a given fully-initialized integer, would the different combinations of uninitialized bytes of it that one could construct be distinct values?

They would have to to preserve that information, yes.

Is there any reason not to go with "if any of the bytes in memory is uninit, the whole integer value read out is uninit" for now? It seems strictly more conservative (in particular re: LLVM poison semantics), and since programs can't test whether something is uninit (other than by causing UB if it is), I expect it would be backwards compatible to move to partially-uninit-integers later.

Agreed. I just wanted to bring it up.

sdroege commented 4 years ago

Would this mean that a) creating a &mut [u8] from uninitialized memory would be fine, and b) even reading data from there would be ok? That would seem rather unfortunate as that would mean that safe Rust code would not always be valgrind-clean anymore.

I'm thinking of the case of passing an uninitialized buffer to Read::read() or AsyncRead for optimization purposes, where the implementer of the traits could do anything it wants with the passed in buffer, including to read from it.

RalfJung commented 4 years ago

Would this mean that a) creating a &mut [u8] from uninitialized memory would be fine, and b) even reading data from there would be ok? That would seem rather unfortunate as that would mean that safe Rust code would not always be valgrind-clean anymore.

Since "reading" means different things to different people, please give an example. :)

But notice that this is possible in safe Rust already, and might be considered "reading uninit memory" (I don't know what valgrind has to say about this):

let array = [MaybeUninit::<u8>::uninit(); 128];
let elem = array[42];
RalfJung commented 4 years ago

I'm thinking of the case of passing an uninitialized buffer to Read::read() or AsyncRead for optimization purposes, where the implementer of the traits could do anything it wants with the passed in buffer, including to read from it.

That is unsound and will be unsound under any option discussed here. To make this sound, we need something like "freeze". See https://github.com/rust-lang/rust/pull/58363.

sdroege commented 4 years ago

Since "reading" means different things to different people, please give an example. :)

Actually reading the data behind the uninitialized memory :)

let array = [MaybeUninit::<u8>::uninit(); 128];
let elem = &*(array[42].as_ptr());
println!("{}", elem);

As long as it's passed around as a MaybeUninit<u8> I assume it's all fine, but creating a &u8 from it (or a &[u8] from the whole array) is not fine anymore unless it's pointing at initialized data? Or is that also still fine as long as no dereferencing happens (I hope not)?

But notice that this is possible in safe Rust already, and might be considered "reading uninit memory" (I don't know what valgrind has to say about this):

valgrind is fine with that until we do something like ptr::read() or what my example above does.

Amanieu commented 4 years ago

(I don't know what valgrind has to say about this)

Valgrind only complains if the program performs a conditional branch which depends on uninitialized data.

RalfJung commented 4 years ago

Actually reading the data behind the uninitialized memory :)

Okay so you mean actually performing arithmetic/logic operations on it, not just reading it, stuffing it into a u8 and never using that again. I mean, memcpy is arguably reading, right...?

As long as it's passed around as a MaybeUninit I assume it's all fine, but creating a &u8 from it (or a &[u8] from the whole array) is not fine anymore unless it's pointing at initialized data? Or is that also still fine as long as no dereferencing happens (I hope not)?

Right now, conservatively, creating a &[u8] to uninit memory is UB.

The result of the discussion in this issue was that most likely this should not be UB.

However, either way, your code is UB. There are no plans to allow code to do anything with uninit integers besides loading and storing them from/in memory. This is where freeze comes in -- but that's a whole separate discussion, not fitting into this issue. (But yes, with "freeze" one could likely write safe Rust code that valgrind complains about. That is one of its many drawbacks. It also has some big advantages though.)