Meaning of Undefined and Justification for UB

rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do

https://rust-lang.github.io/unsafe-code-guidelines

Apache License 2.0

652 stars 57 forks source link

Meaning of Undefined and Justification for UB #253

Closed chorman0773 closed 1 year ago

chorman0773 commented 3 years ago

From various responses, I am confused about the meaning of Undefined Behaviour in rust. Coming from a C++ background, and having done extensive personal research on undefined behaviour, I understand the term to be literal, behaviour which is not defined. In C++ and C it is explicitly specified as "Behaviour for which this international standard poses no limitations". In a number of specifications I have written, I have adopted similar wording. As far as I can tell, rust does not explicitly define the term, so I assumed it has the same meaning (and it seems to have that same meaning). In particular this definition permits an implementation which assigns some meaning to undefined behaviour, while still conforming to the standard/specification (As an example, see clang and gcc with union type-punning in C++). However, in particular, a comment on #84 leads me to believe, this would not be valid in rust. If so, would it be reasonable to provide an explicit definition for the term, and is there a particular reason why a restricted interpreation of the term is beneficial to rust?

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition, and the conformance clause of the mentioned standards permit optimizations that assume UB doesn't occur. Rather, the original intent, at least from what I can determine, was to provide an escape hatch to portions of the standard that either cannot be specified or doesn't want to be specified, because some reasonable implementation would not be able to provide particular behaviour. If this is in fact the case in UCG, would it be reasonable to extend this justification to include reasonable implementations, not just optimizations, that are enabled as a result of the undefined behaviour.

RalfJung commented 3 years ago

intentionally

Yes, because at the time there was no better way (the original code predates even clarifying the validity invariant). But the fact that there is a "FIXME" indicates quite clearly that this is considered a hack, not a proper solution.

I have not looked at the RFC but given the choice is to change the language, not the implementation

The RFC is in fact a libs-only change.

comex commented 3 years ago

Standard libraries will frequently doing things that require explicit compiler support to be efficient, not necessarily because it would be impossible otherwise (though as mentioned, a reason for something being in the standard library is that it's impossible to implement in the language itself). For example, while not UB, clang and gcc defined an intrinsic to implement std::make_integer_sequence<T,N> in less than O(n) template instantiations (I think it's O(lgN), lccc does it internally in O(1)).

They do, and Rust uses lang items for that purpose. However:

As you noted, that is different from UB.
While I don't think this is settled policy, some people, including me, believe that Rust should have a goal of making libstd not privileged; anything that requires compiler magic should be in libcore, and for anything that can't be expressed without compiler magic (such as Box), suitable stable language functionality should be added to make it expressible (such as a DerefMove trait). Personally I'd prefer if libcore itself had the essential compiler magic parts split out into a third crate. But that's off topic.

chorman0773 commented 3 years ago

Rust uses lang items for that purpose

That and intrinsics, indeed. Though this still begs the question of what the difference between an unstable lang item/intrinsic/language feature, and undefined behaviour explicitly provided by an extension. As mentioned, lccc has distinct lang items from rustc (though it shares some when they relate to the same feature, for example #[lang = "sized"] is still used to define the Sized trait), it also has different intrinsics (again sharing common ones, though some are renamed). It's all simply privileged code, doing privileged things that you can't do portably (and maybe not at all, depends on if it's documented).

anything that requires compiler magic should be in libcore

That's fair on the requirement side (with a cursory look through at the documentation, it looks like Box may be the only thing outstanding). However, compiler magic should always be available to optimize code, particularily when you, as the standard library writer, are also are that the compiler itself may not optimize something as well. Sometimes this involves exploiting something actually UB, but that the compiler does not treat as such (in the example, that a reference to uninit isn't allowed, or at least was probably not at the time). This isn't even technically limited to standard libraries. For example, in the lccc code itself, written in C++, I conditionally insert a call to a builtin used to optimize slice::from_raw_parts{,_mut} to make those same optimizations. In fact, this is one of the reasons I think the compiler should be discoverable by rust code, so that when available, programs can selectively use non-portable constructs as optimizations, and fallback on other similar constructs, or on portable ones.

RalfJung commented 3 years ago

Though this still begs the question of what the difference between an unstable lang item/intrinsic/language feature, and undefined behaviour explicitly provided by an extension.

I am not sure why you cannot accept the fact that the rustc devs and the people specifying Rust consider it an antipattern to explicitly define any part of UB via any means.^^ Adding an unstable intrinsic is a tiny extension, whereas saying that integers may be udnefined is a global change that fundamentally alters the Abstract Machine of the language that rustc actually implements (and that rustc optimizations have to be checked against).

chorman0773 commented 3 years ago

Adding an unstable intrinsic is a tiny extension

For something like reachability of pointers, I would consider that similarily small. Particularly, an intrinsic can be created to get a pointer to access an enclosing repr(C) structure from a pointer to it's first element (in fact, lccc does have such an intrinsic, ::__lccc::builtins::cxx::__builtin_launder). lccc simply considers a pointer cast to do the same, primarily because it's a fundamental side effect of the implementation (as for pointers becomes the IR operation convert reinterpret which also implements reinterpret_cast from C++, so as a side effect, the semantics become the union of valid operations). The intermediate representation itself is to have a specification, which is used by frontends that implement various languages, and by optimizers and codegenerators, to ensure that both of the latter can be independent of the source language without being incorrect, and that the frontend can be independent of the requested optimizations and code generation. Saying this definition applies is a rather trivial extension, considering that the implementation is already enjoined from making optimizations that contradict it. Portable code cannot rely on this, but code that has knowledge this applies can exploit it, much the same as code that has knowledge of a particular intrinsic can exploit it

As another example, miri is an implementation that provides definitions for certain kinds of undefined behaviour, it happens that the definition is that miri traps on those cases. This is a valid definition though, and employed by all forms of dynamic analysis tools. This is why I prefer the C++ definition ([intro.abstract] clause 4, clause 5), that the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour, rather than saying implementations can assume it cannot happen. The former allows a much broader interpretation, as the latter can be (and is) a derived consequence of the former, but the former also concedes that implementations can assign meaning, including by trapping on detection.

bjorn3 commented 3 years ago

For something like reachability of pointers, I would consider that similarily small. Particularly, an intrinsic can be created to get a pointer to access an enclosing repr(C) structure from a pointer to it's first element (in fact, lccc does have such an intrinsic, ::lccc::builtins::cxx::builtin_launder).

For such intrinsic to have the semantics you want, you would need to modify the abstract machine a lot more than you seem to think. For example adding the try instrinsic is not a simple modification. It requires adding the concept of unwinding to the rust abstract machine. Another example are the atomic intrinsics. Those require adding support for a weak memory model. Sure, many intrinsics are small incremental additions without influencing the rest of the abstract machine. This would be intrinsics like transmute (except for the size check implementable using only unions), checked math (implementable by first checking for overflow before performing the operation), but some intrinsics require fundamental changes to the rust abstract machine.

The former allows a much broader interpretation, as the latter can be (and is) a derived consequence of the former, but the former also concedes that implementations can assign meaning, including by trapping on detection.

If the implementation can assume that UB doesn't happen, it is free to do anything it wants for things that would be UB, including inserting traps, as those things "wouldn't happen" anyway.

RalfJung commented 3 years ago

In fact, Miri is an excellent example for how an implementation can provide explicit guarantees for what happens on UB (raise an error) while still conforming with the formal spec which says "the implementation may assume that UB does not occur". Miri is just rather defensive about that assumption and double-checks instead of just trusting the programmer.

The wording for UB given in the UCG glossary totally allows for this possibility. In my view "the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour" is a consequence of "the implementation may assume UB does not happen", not vice versa. My proposed wording is a logical implication ("if the program does not exhibit UB, then the implementation will realize program behavior"), and as usual with implications, they impose no restrictions on what happens when the antecedent is false -- in this concrete case, this means imposing no restrictions on what happens when the program has UB. But I think viewing UB as a proof obligation explains much better why it is so useful, and how programmers can work with it (by checking the proof obligations everywhere). This works particularly well in Rust, where we expect each unsafe operations to state the requirements imposed on the caller -- those are the exact same kind of proof obligation!

chorman0773 commented 3 years ago

For such intrinsic to have the semantics you want, you would need to modify the abstract machine a lot more than you seem to think

Fair. However, in the presence of such an intrinsic, making pointer casts semantically have the same effect is not a fundamental change. It's a choice in how to implement the latter. The intrinsic mentioned is not a question of how it affects the abstract machine of rust, it exists and it is impossible for it to not exist (because of how name resolution of builtins is defined). The builtin exists to satisfy a requirement of the C++ abstract machine, std::launder, and because it needs to affect optimizations, the IR needs to be aware of it. As a result and side-effect, it can be named from rust code that uses the lccc_intrinsic_crate feature (even if it couldn't, the same feature could be used to call the ::__lccc::xir! macro, used in libcore to emit explicit IR).

It requires adding the concept of unwinding to the rust abstract machine

Didn't the rust abstract machine have that model before rustc added that intrinsic? Also I presume we can ignore in this argument any and all intrinsics for which the existence is implied by the standard library itself. The standard library itself is part of the abstract machine, after all. The fact core::intrinsics::transmute exists has no bearing on the semantics of the abstract machine (beyond being nameable and an alternative spelling of the latter), because core::mem::transmute is defined.

it is free to do anything it wants for things that would be UB, including inserting traps

... including to assign a particular, well-defined meaning to it? My point isn't that it can break some optimization performed by the compiler, but that such an optimization, in the case of lccc, would be valid anyways.

In my view "the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour" is a consequence of "the implementation may assume UB does not happen", not vice versa

Similarly fair. It can be viewed either way, and the former has definitely come to imply the latter. The fact UB never happens is one of the best truths a compiler writer has at their disposal. However, just as math can be built upon by removing restrictions, so to can programming languages. It's still a rust implementation because it fits within the behaviour prescribed by the rust abstract machine (provided that behaviour can be worked out).

bjorn3 commented 3 years ago

It requires adding the concept of unwinding to the rust abstract machine

Didn't the rust abstract machine have that model before rustc added that intrinsic?

The intrinsic has existed ever since https://github.com/rust-lang/rust/commit/c35b2bd226736925961ca6853b2ef29e8094cd90. Before that, it directly used a function written in LLVM ir, which you could also consider a kind of intrinsic.

Also I presume we can ignore in this argument any and all intrinsics for which the existence is implied by the standard library itself.

Almost all intrinsics are implied by the standard library itself.

The standard library itself is part of the abstract machine, after all.

No, it is not. The rust abstract machine defines how MIR is executed. The standard library merely exposes some parts of the rust abstract machine that can't directly be accessed in a stable way using the user facing language that lowers to MIR. Saying that the standard library is part of the rust abstract machine is like saying that all existing unstable code that compiles with the current version of rustc is part of the abstract machine as it can access intrinsics. It is like saying that all C/C++ code is part of the C/C++ abstract machine because it can call intrinsics.

chorman0773 commented 3 years ago

This works particularly well in Rust, where we expect each unsafe operations to state the requirements imposed on the caller -- those are the exact same kind of proof obligation!

This is also an expectation at least in how I document anything. However, if I accept a valid raw pointer, I don't say that I can assume it does, I write something like

Preconditions: ptr shall point to an object or be a null pointer

Preconditions: ptr shall point to an object, passed the end of an object, or be a null pointer.

(with the last part optional). This is sufficient to express that the result is undefined behaviour, according to the library in question, if ptr does not satisfy either.

chorman0773 commented 3 years ago

It is like saying that all C/C++ code is part of the C/C++ abstract machine because it can call intrinsics.

The C++ Standard library is part of the C++ Abstract machine. It is defined as part of the document that says "The semantic descriptions in this document define a parameterized nondeterministic abstract machine." ([intro.abstract] clause 1). The standard library isn't just another user-defined library, it's a fundamental portion of how rust is specified. core::mem::zeroed is what introduces the ability to initialize values with an all-zero bit pattern, not the core::intrinsics::init function that implements it, because the latter isn't part of the rust language, it's part of the implementation in rustc. Similarily, the function __builtin_launder in gcc, clang, or it's fully qualified version from lccc (::__lccc::builtins::cxx::__builtin_launder) are not part of the C++ abstract machine, std::launder is, and the former implements the latter. The definition of ::core::mem::drop is not

fn drop<T>(x: T){}

It is

Disposes of a value. This does so by calling the argument's implementation of Drop.

How drop(x) drops the owned x is entirely the business of the implementation, not of the standard library definition, but the semantics is that x is moved or copied into the function call, then dropped. I can implement drop similarily as

fn drop<T>(x: T){
    unsafe{ core::ptr::drop_in_place(&x) }
    forget(x)
}

This would be a stupid, but perfectly valid implementation for core::mem::drop.

bjorn3 commented 3 years ago

core::mem::zeroed is what introduces the ability to initialize values with an all-zero bit pattern, not the core::intrinsics::init function that implements it, because the latter isn't part of the rust language, it's part of the implementation in rustc.

core::mem::zeroed is defined to be equivalent to MaybeUninit::zeroed().assume_init() and in fact is implemented this way. MaybeUninit::zeroed() is implemented without intrinsics as:

let mut u = MaybeUninit::<T>::uninit();
// SAFETY: `u.as_mut_ptr()` points to allocated memory.
unsafe {
    u.as_mut_ptr().write_bytes(0u8, 1);
}
u

where MaybeUninit::uninit is implemented as MaybeUninit { uninit: () }, so this is a bad example as it can be implemented as regular code.

In my opinion the fact that intrinsics are an implementation detail used to implement certain standard library functions doesn't mean that the rust abstract machine includes the standard library. The rust abstract machine is in my opinion solely implemented by the rust compiler. IMO it includes parts that are stable defining how stable code works, and it includes unstable parts used to implement the standard library. The standard library simply depends on certain unstable parts of the rust abstract machine the same way it depends on certain unstable language features like specialization. These unstable parts can differ from compiler version to compiler version or even from compiler to compiler.

The specific LLVM version is not a part of the rust abstract machine, not even an unstable part, as the same rustc version can be compiled against a wide variety of LLVM versions. This means that it is not ok for the standard library to depend on UB that just so happens to not cause a miscompilation on a specific LLVM version. Thanks to rustc_codegen_cranelift (cg_clif) (author here) even the existence of LLVM itself is not part of the rust abstract machine. Not even an unstable part.

The only part of the standard library that could be considered part of the rust abstract machine is stdarch (core::arch). This contains a lot of platform intrinsics for simd that directly use llvm intrinsics. Every other bit of the standard library is completely agnostic to the codegen backend. Because of this combined with the fact that all functions in stdarch are marked as #[inline] and thus only codegened when used, cg_clif is able to compile the standard library without changing it in any way despite using Cranelift instead of LLVM.

Besides, not giving the standard library a privileged position makes it easier to understand how it works and makes it safer to just copy snippets from it into your own code.

chorman0773 commented 3 years ago

core::mem::zeroed is defined to be equivalent to MaybeUninit::zeroed().assume_init()

Exactly, defined. It isn't necessarily implemented in terms of (in fact, in lccc, it's the latter, MaybeUninit::zeroed() is implemented in terms of core::mem::zeroed.

The abstract machine is the sum of the behaviour specified by the specification. If the specification includes the standard library, then the standard library is part of that abstract machine. The standard library isn't part of a program, it's something that exists because of the specification, and has it's semantics defined by the specification, so it's semantics fall definatively under part of the abstract machine, even if those semantics can be perfectly replicated in user written code. Absent part in the specification, it wouldn't be a violation of the as-if clause to not provided the standard library. The implementation of the standard library or compiler has absolutely no bearing on the abstract machine.

Saying that the compiler defines the abstract machine is a great way to reduce the possibility of a competing implementation, something I am very much against. The compiler should have an argumentative position, in saying what can and cannot be done, but should not have the position of defining the behaviour. Intrinsics and lang items would fall under a specification like "The implementation may provide any number of unspecified unstable features, with unspecified semantics when enabled by a crate-level attribute declaring the feature. If the implementation does not support the feature or enabling features, the program is ill-formed."

Besides, not giving the standard library a privileged position makes it easier to understand how it works and makes it safer to just copy snippets from it into your own code.

Inherently, it has privilege, because it can access private implementation details such as intrinsics and lang items. I certainly wouldn't want to just abuse extensions without saying anything, maybe I'd include something that mentions the extension and it's non-portability, but people already cannot simply copy just anything from the standard library, because it may be feature gated. For example, code that uses the pointer-interconvertibility rule would have this

// SAFETY: This is sound because lccc permits casts between pointers to *pointer-interconvertible* objects,
// And we know we have a borrow-locked field of `WideType` *pointer-interconvertible* with *narrow. 
// This is an extension and not portable to other implementations.
// stdlib can do this because it expects only to be running on lccc. 
// Note: See C++ Standard [expr.static.cast], clause 13, as well as xlang ir specification, [expr.convert.strong] and [expr.derive.reachability] for details on the validity of this cast.
let wide = unsafe{&mut *narrow as *mut WideType};

bjorn3 commented 3 years ago

Exactly, defined. It isn't necessarily implemented in terms of (in fact, in lccc, it's the latter, MaybeUninit::zeroed() is implemented in terms of core::mem::zeroed.

In a regular crate, you may also have one function defined to be equivalent to another. If it is or not is simply an implementation detail, not a part of the rust abstract machine.

The standard library isn't part of a program, it's something that exists because of the specification, and has it's semantics defined by the specification, so it's semantics fall definatively under part of the abstract machine, even if those semantics can be perfectly replicated in user written code.

The standard library is not an intrinsic part of the rust language. You can very well use it without, albeit only on nightly rustc versions. In fact I know of at least one past alternative standard library called lrs-lang. While it uses internal compiler interfaces, cg_llvm, cg_clif or miri wouldn't have to be changed to make it run. It would only need to be updated to for the latest version of all unstable interfaces it uses.

Saying that the compiler defines the abstract machine is a great way to reduce the possibility of a competing implementation, something I am very much against.

What I say is that the abstract machine is kind of split in two parts. A stable part that all rust code can use and an unstable part that is used to implement the standard library. Both parts need to have well defined semantics, but the semantics of the unstable part may change between rustc versions. lccc may decide to have a different unstable part and would thus need to change the standard library. That doesn't mean that the standard library is allowed to circumvent the rust abstract machine. It can only use well defined interfaces like intrinsics.

Inherently, it has privilege, because it can access private implementation details such as intrinsics and lang items.

Yes it has extra privileges as it can use unstable interfaces. It just shouldn't do things that normal code isn't allowed without doing so using unstable interfaces. All data structure implementations don't need intrinsics to be implemented, they can just use stable functions. This means that copy paste should just work. (after removing stability attributes) If those data structures were to use the knowledge about the specific codegen backend to for example violate the aliasing rules in such a way that the codegen backend doesn't miscompile it, then copy paste will cause problems for users later down the line. As such it shouldn't violate those rules, even if it technically can. This is what I mean with that it is not ok for the standard library to depend on UB. It is completely fine to use implementation defined intrinsics, but don't ever cause UB.

chorman0773 commented 3 years ago

If it is or not is simply an implementation detail, not a part of the rust abstract machine.

If the crate defines as part of it's api that the functions are equivalent, then that definition is not an implementation detail. It's not part of the rust abstract machine either, but it's part of the public API specification for that crate, just as the standard library is a and should be a part of the rust specification. Of course, the specification does not bind a particular implementation, and it is up to the particular implementation whether or not to write one in terms of the other, or both in terms of the same, or completely independant implements. And if one is in terms of the other, it's also up to the particular implementation which is done.

The standard library is not an intrinsic part of the rust language. You can very well use it without, albeit only on nightly rustc versions

By saying this, it can be deduced that the existance of libcore, liballoc, and libstdor the content thereof is optional to provide. In fact, the opposite is true, and are only unavailable upon explicit request. Unless you specifically have #![no_std] or likewise #![no_core] in a crate root, all 3 must be provided and in the extern prelude, this is part of the specification (core must always be provided, absent the use of #![no_core]). Further, the content and semantics of these crates are defined by the same specification. As stated, with C+, we can deduce that standard library semantics are part of the abstract machine, because they are not exlcuded from the definition in [intro.abstract] clause 1 ("The semantic descriptions in this document define a parameterized nondeterministic abstract machine", not stating that particular semantic descriptions, or even qualifying the sections, so semantic descriptions in [library] through [thread], which makes up "the standard library", fall under the same). If rust chooses the same, which implies that the as-if rule extends to that content, then the same reasoning applies.

It is completely fine to use implementation defined intrinsics, but don't ever cause UB.

An implementation can introduce undefined behaviour to a program, provided it still behaves as if it did not. llvm does this when it hoists certain operations that may be undefined out of loops, and gives limited meaning (by producing poison instead of immediate undefined behaviour) to the operations. Doing the optimization manually be UB. Here I'm not even talking about when the codegen doesn't happen to miscompile, I'm talking about honestly well-defined extensions, a promise that the same will never start to miscompile on the compiler. Copy-paste from the standard library is already non-portable, because the same unstable features may not be available, or may be implemented differently, so copy-pasting code that clearly contains an extension (Written as such) will not alter this behaviour. Part of this is done to satisfy zero-cost abstractions, that a manual implementation could not possibly be more efficient than the provided one. If we admit that the compiler can change it's implemented abstract machine, by introducing unstable features, then the ability to assign meaning to undefined behaviour seems to be a natural extension of that.

bjorn3 commented 3 years ago

An implementation can introduce undefined behaviour to a program, provided it still behaves as if it did not.

Then it is not UB.

llvm does this when it hoists certain operations that may be undefined out of loops, and gives limited meaning (by producing poison instead of immediate undefined behaviour) to the operations. Doing the optimization manually be UB.

It is only UB to use poison in a way that can't propagate the poison further through for example a math operation. The mere existence of a poison value is not UB.

If we admit that the compiler can change it's implemented abstract machine, by introducing unstable features, then the ability to assign meaning to undefined behaviour seems to be a natural extension of that.

Unstable features require an explicit opt-in and only work on nightly. Using undefined behaviour that has been assigned a meaning by a specific implementation doesn't however require an opt-in. If you copy-paste code that uses an unstable feature, it will simply not compile when the unstable feature is not available. If you copy-paste code that uses it's knowledge of an assigned meaning to certain UB, then it will still compile when the UB doesn't have an assigned meaning by a specific implementation. Instead it may or may not behave in strange ways at runtime. This is much much worse than not compiling.

chorman0773 commented 3 years ago

Then it is not UB.

Yes, yes it is. The compiler doesn't decide what behaviour is undefined, it only gets to decide what to do about it (though has effectively unlimited choice in that). A crucial point of [intro.abstract], clause 1 is that the implementation does not have to implement or even emulate the abstract machine, only that it must emulate it's observable behaviour. This is the as-if rule, not the first sentance which says the abstract machine exists.

It is only UB to use poison in a way that can't propagate the poison further through for example a math operation. The mere existence of a poison value is not UB.

If the particular operation caused, for example, signed overflow in C, then the programmer could not write the same optimization by hand, even though llvm performed it, because the resulting transformed program behaved as-if it was evaluated strictly, wrt. it's observable behaviour.

If you copy-paste code that uses an unstable feature, it will simply not compile when the unstable feature is not available

Features can change meaning without changing names, or even changing syntax. Possibly because of an evolution of it, or because they were written independent (which is why lccc qualifies its feature names not explicitly taken from rustc, so as to reduce this chance of someone else writing the same feature, unless they are implementing the same from lccc). If you used the lang_items feature, you could certainly run into strange results, especially if the compiler isn't written to reject unknown lang items (or if the lang item name is common, but has distinct meanings). For example, the rustc definition of core::ptr::drop_in_place, which is

#[lang="drop_in_place"]
pub unsafe fn drop_in_place(ptr: *mut T){}

has entirely different meaning on lccc, because on lccc, the lang item simply designates the function itself, it doesn't result in special behaviour.

A possibly reasonable thing could be do warn on these casts, similarily to gcc's -pedantic. So #![allow(lccc::pedantic)] and inversely #![deny(lccc::pedantic)] could be used to control this (in fact, I intend something similar, though the details are far from worked out, and that's one of the least important things on the road map). In which case, copying the code would give you (absent the use of either an explicit allow or deny) something like (on standard CLI, which uses GNU style warnings): "This code has undefined behaviour in the rust language, and is provided only as an extension. It is unlikely to be portable to other compilers (note: from -Wextend-provenance, which is enabled by default. This may be disabled with #[allow(lccc::extend_provencence)])". (likewise, with deny/forbid it would instead mention -Werror=extend-provenance)

bjorn3 commented 3 years ago

If the particular operation caused, for example, signed overflow in C, then the programmer could not write the same optimization by hand, even though llvm performed it, because the resulting transformed program behaved as-if it was evaluated strictly, wrt. it's observable behaviour.

LLVM has a special flag to forbid signed overflow. If the optimization would cause signed overflow, then it has to remove this flag. That clang doesn't expose a way to remove this flag, doesn't mean that removing this flag in an optimization is UB.

Features can change meaning without changing names, or even changing syntax.

True. In that case the feature is technically still available, it just has different behaviour. What I am mainly concerned about is stable code that doesn't have access to the unstable features.

A possibly reasonable thing could be do warn on these casts, similarily to gcc's -pedantic.

The cast itself is completely valid. It is just when you dereference it that there is a problem. This dereference could happen arbitrarily far away from the cast. A lint for this without false-positives would need to use a theorem prover. If you also want to avoid false-negatives, you will have to solve the halting problem.

chorman0773 commented 3 years ago

LLVM has a special flag to forbid signed overflow.

add nsw. The result is poison if it would cause the result to wrap into the sign bit. The nsw isn't removed, the result is simply demoted from immediate UB to poison. It does not have to remove the flag, because if the optimization occurs and would introduce new UB, the resulting poison is unused.

This dereference could happen arbitrarily far away from the cast.

The particular example would be if you derefence the pointer resulting from this cast. Also dereferencing a miscasted pointer can already happen very far way. miri could be run to detect that, as it does not have the same behaviour (however, it's run on the rustc libstd, as miri would not accept a sigificant portion of the standard library used by lccc, primarily because there is no one-to-one correspondance between MIR and XIR). I didn't say it's necessarily a perfect idea, but I do agree it's better than code just devolving from it. I also rarely copy stuff from standard libraries, because I know they do some things that you probably shouldn't ever attempt in well-behaved code. I would presume the big obvious SAFETY comment that says "This is an extension and not portable" would be sufficient to inform people of this fact, an extension, and the warning from copying the code verbaitum would reinforce this. As mentioned, the extension exists even if I don't flat out say it does, because black-box code could perform the xir convert reinterpret op, which performs the exact operation I'm saying as does (because there isn't a point to implementing it differently, the optimizations are still enjoined).

DemiMarie commented 3 years ago

I haven’t read the discussion, but I personally prefer Ada’s term for UB: “erroneous execution”.

mohtasham9 commented 3 years ago

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB.

Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use #[cfg] flags and such to do the right thing on multiple platforms or implementations.

chorman0773 commented 3 years ago

Technically, unbounded implementation-defined behaviour can be equivalent to documented UB. However, in almost all cases, implementation-defined behaviour (likewise unspecified behaviour) is constrained, either explicitly or implicitly. For example, C has "the size and alignment requirement of the type int are implementation-defined", but the answer to the question "what is sizeof(int)" cannot be "undefined behaviour", it has to be an actual value (generally 2 or 4, depending on the platform).

On Sun, May 9, 2021 at 12:52 Mohtasham Sayeed Mohiuddin < @.***> wrote:

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB.

Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use #[cfg] flags and such to do the right thing on multiple platforms or implementations.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-lang/unsafe-code-guidelines/issues/253#issuecomment-835842855, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGLD22Z2T3SDQJDNSUORMDTM24VBANCNFSM4SVFJW4Q .

MikailBag commented 3 years ago

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior.

But what is the difference between "undefined behavior" and "implementation-defined behavior which can be undefined behavior in fact" then? In both cases behavior can be undefined, and undefined behavior can be replaced with any other behavior, so these two terms seem equivalent.

Additionally, it means that all operations with implementation-defined semantics have to be unsafe, including integer arithmetic operators, which is a massive breaking change.

chorman0773 commented 1 year ago

I'm closing this ahead of triage.

IMO, the question it poses is answered, and it's also a bit garbage in the comments (sorry about that).