Meaning of Undefined and Justification for UB

rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do

https://rust-lang.github.io/unsafe-code-guidelines

Apache License 2.0

651 stars 57 forks source link

Meaning of Undefined and Justification for UB #253

Closed chorman0773 closed 1 year ago

chorman0773 commented 3 years ago

From various responses, I am confused about the meaning of Undefined Behaviour in rust. Coming from a C++ background, and having done extensive personal research on undefined behaviour, I understand the term to be literal, behaviour which is not defined. In C++ and C it is explicitly specified as "Behaviour for which this international standard poses no limitations". In a number of specifications I have written, I have adopted similar wording. As far as I can tell, rust does not explicitly define the term, so I assumed it has the same meaning (and it seems to have that same meaning). In particular this definition permits an implementation which assigns some meaning to undefined behaviour, while still conforming to the standard/specification (As an example, see clang and gcc with union type-punning in C++). However, in particular, a comment on #84 leads me to believe, this would not be valid in rust. If so, would it be reasonable to provide an explicit definition for the term, and is there a particular reason why a restricted interpreation of the term is beneficial to rust?

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition, and the conformance clause of the mentioned standards permit optimizations that assume UB doesn't occur. Rather, the original intent, at least from what I can determine, was to provide an escape hatch to portions of the standard that either cannot be specified or doesn't want to be specified, because some reasonable implementation would not be able to provide particular behaviour. If this is in fact the case in UCG, would it be reasonable to extend this justification to include reasonable implementations, not just optimizations, that are enabled as a result of the undefined behaviour.

Diggsey commented 3 years ago

UB is the same in Rust as it is in C++.

In particular this definition permits an implementation which assigns some meaning to undefined behaviour

A compiler implementation could specify what happens for some subset of programs which have UB according to the Rust language. However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition

Even from the beginning, the C/C++ standards left things undefined specifically to allow compilers to translate code to more efficient machine code. For example, this is why int can be different sizes, and it infects almost every part of those standards. You're right that they were not originally designed to specify an "abstract machine" - that became a necessity later as compilers started being more aggressive - but optimization, and generating more efficient code, drove a large part of the decision-making from the very start.

There's no reason to leave something as UB unless it allows for some optimization, because UB has a significant cost. Instead, if no optimizations can be enabled, it would be better to specify the behaviour, or define a range of reasonable behaviours, but leave the exact choice up to the implementation.

RalfJung commented 3 years ago

Thank you for moving this discussion to a separate thread!

As far as I can tell, rust does not explicitly define the term

There is a definition of UB in our glossary. This definition coincides with how modern C/C++ compilers interpret UB in their respective languages. (I should add though that the UCG glossary represents UCG consensus, not Rust-wide consensus.)

There is also an excellent blog post by Raph Levien that goes a bit into the history of UB. According to that post, UB in C/C++ used to be more about "we do not want to restrict what hardware does" than about enabling optimizations, but this meaning has shifted over time. In my opinion, UB is a terrible word for how the term is used today, I think something like "language contract" or so is much clearer, but I'm afraid we are probably stuck with it. The concept itself however is great: it is a way for the programmer to convey information to the compiler that the compiler would have no way to infer itself. However, problems arise when the programmer does not realize what information they are conveying. This happens a lot in C/C++ (when a programmer writes + they might not want to convey "I carefully checked that this will not overflow"), and also happens in Rust with some of the more subtle UB, in particular around validity and aliasing.

RalfJung commented 3 years ago

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition

Historically, UB might not have started as being primarily for optimizations, but over the last few decades that is certainly the case. To give one example, strict aliasing is UB in C and that UB has only one purpose: more optimizations. (Specifically, the story I was told is that C compilers needed to be able to compete with Fortran compilers. Fortran has very strong aliasing guarantees, and the only way they saw to make C competitive is to also have some aliasing guarantees in C.)

In Rust, without the historical luggage of C/C++, we use UB only for optimizations. There are better ways to handle platform and implementation differences, as @Diggsey mentioned. For example, we have little-endian and big-endian platforms, and this is handled by having an explicit parameter in the Rust Abstract machine defining endianess. So it is not UB to do something byte-level with multi-byte integer types, but results differ per platform. Such differences should obviously be kept to a minimum to make code maximally portable, but there can be good reasons to introduce them. Likewise, integer overflows are defined to either raise a panic or produce 2's-complement overflowing results (and in practice this is controlled by compiler flags). In such cases it is important to precisely specify what all the possible cases are, so that programmers can make their code correct wrt. all Rust implementations. This is what sets such platform/implementation differences apart from UB.

In particular this definition permits an implementation which assigns some meaning to undefined behaviour

A compiler implementation could specify what happens for some subset of programs which have UB according to the Rust language. However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.

To add to this, the purpose of the UCG (unsafe-code-guidelines WG) is to specify Rust, not to specify a particular Rust implementation. Basically, the long-term goal of the UCG is to produce something akin to (but better than ;) the C/C++ spec. As far as the spec and therefore the UCG is concerned, programs with UB are just wrong, period. This is the same as in C/C++: the spec does not discuss any such implementation-specific guarantees.

Some members of the lang team have also expressed a preference in the past of not making any extra promises in rustc [the implementation] for things that are UB in Rust [the language]. They want to avoid fragmenting the language into dialects that only work with some implementations. Worse, since there is only one implementation currently, there is a huge risk of any such implementation-specific promise becoming a de-facto guarantee that the entire ecosystem relies on.

Therefore, as far as the rust-lang organization is concerned, programs with UB are beyond salvaging. They are not subject to stability guarantees (or any guarantees really) and they need to be fixed. Implementations could assign meaning to UB programs, but rustc [the implementation] does not. In fact it would be healthier for the ecosystem if alternative implementations (once they exist) do not do so, either, since any such guarantee is an ecosystem split -- programs that run fine in one implementation do not run fine in another. Effectively, if an implementation makes such a promise, then it implements a different language, with a different Abstract Machine. That's why I talked about "dialects".

In practice, rustc [the implementation] will do what it can to help programmers even if their programs have UB, if it does not compromise UB-free programs. Usually the goal here is to make the programmer aware of that problem so that they can fix their code. Sometimes we even temporarily take back changes that "break" UB programs until UB-free ways to do things are possible; this happened around unwinding for extern functions. (I put "break" in quote because with my spec hat on, UB programs are already broken, the compiler did not do anything wrong.) We do not just ignore the needs of users that have UB in their code, but the goal of that conversation is always to find a non-UB way for them to do what they need to do. None of this really changes the fundamental stanza on UB, in particular as far as the spec and the UCG are concerned.

digama0 commented 3 years ago

Does rust have a category corresponding to C/C++'s "implementation defined" then? It sounds like we would want to avoid it, and as long as "rust = rustc" it's a bit difficult to distinguish implementation defined from plain old defined behavior.

RalfJung commented 3 years ago

Does rust have a category corresponding to C/C++'s "implementation defined" then? It sounds like we would want to avoid it, and as long as "rust = rustc" it's a bit difficult to distinguish implementation defined from plain old defined behavior.

Not yet, mostly for the reason you mentioned. I think such questions will come up way later in the process.

The IMO more interesting other "kind of behavior" to talk about is unspecified behavior, which is closely related. There was an attempt to define it that failed to reach consensus. (That PR should likely be closed and a new one started.) The only real difference between "unspecified" and "implementation-defined" is that for the latter, implementations need to document which choice they make -- so once we nailed down what "unspecified behavior" means, we pretty much also covered "implementation-defined", we just need to decide on a case-by-case basis if implementantions ought to document a choice (and guarantee that choice for future versions of the implementation) or not.

chorman0773 commented 3 years ago

There's no reason to leave something as UB unless it allows for some optimization

Well, asside from permitting implementations with behaviour that diverges from possible specifications. This was a primary reason why signed integer overflow was undefined in C/++, because there were too many possible behaviours, depending on the machine architecture, and the signed integer format (which was unspecified until C++20, and I believe C2x also has the same thing). Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined.

I would still say its a good idea to define the term somewhere, as to avoid issues with interpretation.

However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.

I do agree. However, it is always a good idea to acknowledge a particular implementation may promise to respond to particular UB in a particular way, and that many implementations may all agree on this meaning (returning to my type-punning unions example), so it is possible to exploit that known extension (one of my personal rules of UB says that Known Compiler Extensions are fine). From the response I got, it seems like its illegal to "define" undefined behaviour in rust, even though it may even be necessary to implement the specification itself (in several places in my implementation of libcore, I transmute between &[T], which is technically repr(Rust), and a repr(Rust) type RawSlice<T>, which is strictly UB in rust, but RawSlice<T> is lang-itemed to define the layout of &[T], so by that knowledge, its fine. Compiler support and standard libraries get to basically have free reign, because if something isn't defined by the lang, they can just add an extension that defines it). One example is that as a companion to the strict definition, I normally include a note giving a non-exhausitve list of resulting behaviours, and mention in that that an implementation may assign meaning to the undefined behaviour. In my C++ API, in the section defining the term, I have the following note:

Note - Valid responses to undefined behaviour include (but are not limited to) assigning meaning to it, discarding the construct (and potentially surrounding code that leads to it), ignoring the behaviour (potentially causing further issues with well-defined constructs), and causing or reporting an error. - End Note

It may be reasonable to include such a note, or some acknowledgement that a valid implementation may assign actual meaning to the behaviour. In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.

Worse, since there is only one implementation, there is a huge risk of any such promise becoming a de-facto guarantee that the entire ecosystem relies on.

This is one of the reasons I am working on lccc, so that rustc's implementation does not become the de-facto standard.

it would be healthier for the ecosystem if alternative implementations do not do so, either, since any such guarantee is an ecosystem split.

Sometimes yes, there are some instances when this becomes necessary. For example, low level code may need further guarantees the language does not provide, and that is why such extensions exist. Many of lccc's extensions are inherited from the fact it's designed to be a compiler for Rust, C, and C++ and successfully operate code that would work with gcc or clang (in particular, support of libstdc++ and libc++ is a goal because of binary compatibility), so many of the rules implemented for rust are lessened where C or C++ has weaker requirements and vice-versa.

In Rust, without the historical luggage of C/C++, we use UB only for optimizations. There are better ways to handle platform and implementation differences

As I mentioned above, sometimes it is infiesable to do so, as it requires adding additional requirements to the spec, that may end up incredibly vague (See the signed overflow example). If you discuss about "trapping" how do you define that to cover all possible ways traps can occur and be handled? Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that reinterpret_cast<T*>(&u) where u has a different type U may not be a bitwise identity) may change arbitrarily depending on types. I have noted that as a result of strict-aliasing, it is possible to compile C and C++ code to JVM bytecode, with a (substantial) support library without having to emulate memory. Without it, while still maintaining "typed-memory", it becomes substatially harder. union rules are for the same reason. Both are examples of implementation differences that are not possible to effect within the bounds of defined but unspecified behaviour (again, primarily because it would be incredibly stupid to talk about what the heck trapping is, does, and means). Arguably its better to say something is UB that you have to avoid like the plague, than to add 5 new sections discussing something novel, only for the result to be incredibly vague and completely unreasonable to work with.

digama0 commented 3 years ago

@RalfJung

The IMO more interesting other "kind of behavior" to talk about is unspecified behavior, which is closely related. There was an attempt to define it that failed to reach consensus. (That PR should likely be closed and a new one started.) The only real difference between "unspecified" and "implementation-defined" is that for the latter, implementations need to document which choice they make -- so once we nailed down what "unspecified behavior" means, we pretty much also covered "implementation-defined", we just need to decide on a case-by-case basis if implementantions ought to document a choice (and guarantee that choice for future versions of the implementation) or not.

I don't see why undocumented things would ever be a good idea. Unstable things might be, and really I think most of rustc is currently in that category: all the nightly features are clearly not UB but also not among the (very few!) actually stable and defined behaviors that are in the UCG document.

@chorman0773

Well, asside from permitting implementations with behaviour that diverges from possible specifications. This was a primary reason why signed integer overflow was undefined in C/++, because there were too many possible behaviours, depending on the machine architecture, and the signed integer format (which was unspecified until C++20, and I believe C2x also has the same thing). Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined.

I agree with Ralf that this was a huge misstep on the part of the C/C++ committees. This really should have been implementation defined behavior (or platform-specific behavior), not undefined behavior. Making things like signed overflow UB makes things much more hazardous for the programmer, and when you couple it with the newly re-imagined UB as license for optimization you have a recipe for disaster.

I do agree. However, it is always a good idea to acknowledge a particular implementation may promise to respond to particular UB in a particular way, and that many implementations may all agree on this meaning (returning to my type-punning unions example), so it is possible to exploit that known extension (one of my personal rules of UB says that Known Compiler Extensions are fine).

My interpretation is that it is allowed for compilers to extend the language, but it is discouraged, because we would much rather incorporate those extensions into the language itself or come up with some suitable alternative that doesn't require creating language dialects. In particular, if you do fewer optimizations than rustc, or are doing something that matches better with C semantics, and as a result can (and are willing to) make more guarantees about behavior that would normally be undefined, I don't think that would be a problem. But programmers won't really be able to rely on it unless they write only for your compiler.

From the response I got, it seems like its illegal to "define" undefined behaviour in rust, even though it may even be necessary to implement the specification itself (in several places in my implementation of libcore, I transmute between &[T], which is technically repr(Rust), and a repr(Rust) type RawSlice, which is strictly UB in rust, but RawSlice is lang-itemed to define the layout of &[T], so by that knowledge, its fine. Compiler support and standard libraries get to basically have free reign, because if something isn't defined by the lang, they can just add an extension that defines it).

This is not UB, this is dependence on unspecified behavior. All types in rust have a layout, and if you write code for the layout that actually occurs then that is not UB. So as long as you are willing to live with the lack of stability, you can depend on the layout of repr(Rust) things, as long as you don't guess the type layouts incorrectly (possibly because you are the standard library and thus have control over such things).

This is one of the reasons I am working on lccc, so that rustc's implementation does not become the de-facto standard.

I for one am glad you are doing so. It is easy to get into a mindset that is aligned to the single implementation, and accidentally equate rustc behaviors to Rust behaviors, and I hope that a re-implementation will shake things up.

As I mentioned above, sometimes it is infiesable to do so, as it requires adding additional requirements to the spec, that may end up incredibly vague (See the signed overflow example). If you discuss about "trapping" how do you define that to cover all possible ways traps can occur and be handled?

I think implementation defined behavior or platform-specific behavior handles this well; on a particular platform or with implementation context, you can say more about what exactly a "trap" entails, for example, and the main spec doesn't have to touch it.

Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that reinterpret_cast<T*>(&u) where u has a different type U may not be a bitwise identity) may change arbitrarily depending on types.

I don't think the meaning can change arbitrarily, at least in Rust. Also AFAIK transmute is always a bitwise identity, although it may still be undecided what it does to shadow state (SB treats it as a no-op right now).

Arguably its better to say something is UB that you have to avoid like the plague, than to add 5 new sections discussing something novel, only for the result to be incredibly vague and completely unreasonable to work with.

One additional desideratum that rust has for its UB is that it should be dynamically checkable, using Miri. I'm not totally sold on this being an iron rule, but it is definitely a major improvement on the C/C++ situation where there are mines around every corner and no way to know that you have stepped on one until it is far too late. So that is an additional reason why we might not want to throw everything into the UB-bucket, if it involves a property that is not (easily) decidable.

chorman0773 commented 3 years ago

This is not UB, this is dependence on unspecified behavior.

I don't know how correct it is, however, n the current rustonomicon, it explicitly mentions that transmuting between non-repr(C) (which I have taken to specifically mean repr(Rust), because transmute is one of the exact purposes of repr(transparent) is UB) as UB apparently this has changed, and is no longer the case. It did previously include that though.

This really should have been implementation defined behavior (or platform-specific behavior), not undefined behavior.

My question is how you define trapping. As I mentioned, trapping is outside the bounds of the C++ Standard, so it wouldn't be an acceptable choice for unspecified behaviour. In arm, signed integer overflow causes a hardware trap, so if we imply exclude trapping behaviour that either requires extra code to support arm, or arm is no longer a valid target.

and the main spec doesn't have to touch it.

It would come down to being defined, since a trap would explicitly interact with observable behaviour and whether or not such observable behaviour occurs. The requirement would have be extraordinarily vague, which is worse than the current status-quo, at least we know that overflow is something to not touch.

I don't think the meaning can change arbitrarily, at least in Rust.

In C++ and C, it acknowledges implementations where that is the case. Right now, rust is impossible (or at least, unreasonably difficult) to implement on such an implementation. This isn't necessarily an issue in practical implementations, as many of them are theoretical implementations, or things done "for fun" (see the jvm implementation I mentioned).

it is definitely a major improvement on the C/C++ situation where there are mines around every corner and no way to know that you have stepped on one until it is far too late.

My question then becomes, would you rather something be specified as UB, or something be so vaguely specified because the actual specification is unreasonable or impossible, so that its possible to derive a valid interpreation where it is, in fact, UB (which means the point itself is UB, because compilers love the "best" possible interpreation as we have established). Going to the trapping example, if its left to the platform to decide what "trapping" is, what if the decision is that a result that traps is UB. How would you define "trapping" so that the behaviour of implementations that do trap, but handle traps in particular ways, or may not even have the ability to handle traps, or whether a trap is possible to handle depends on arbitrary state, etc. such that there isn't a valid interpreation where the result is UB, or effectively UB. I did bring this up in #84 (though I will concede it was off-topic there), where the layout rules of enums are unspecified, but with niche-optimizations, it was possible to devine an interpretation where unspecified behaviour (relying on the unspecified layout of repr(Rust) enums) could be elevated to undefined behaviour. Specifying something as unspecified behaviour that can result in undefined behaviour is the same as calling it undefined, except now it's hidden under interpreting the specification with a language lawyer hat, which is less fun for regular programmers I'm sure.

Diggsey commented 3 years ago

would you rather something be specified as UB, or something be so vaguely specified because the actual specification is unreasonable or impossible, so that its possible to derive a valid interpreation where it is, in fact, UB

UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague.

digama0 commented 3 years ago

@chorman0773

I don't think the meaning can change arbitrarily, at least in Rust.

In C++ and C, it acknowledges implementations where that is the case. Right now, rust is impossible (or at least, unreasonably difficult) to implement on such an implementation. This isn't necessarily an issue in practical implementations, as many of them are theoretical implementations, or things done "for fun" (see the jvm implementation I mentioned).

Well, rust doesn't have reinterpret_cast, so it doesn't really matter what this does. Rust has transmute and this does a bitwise reinterpretation of the value. I'm not exactly sure what about this makes it impossible to implement, unless you are talking about typed memory (in which case yes you are probably limited by the ease of doing a bitcast in the typed memory).

Going to the trapping example, if its left to the platform to decide what "trapping" is, what if the decision is that a result that traps is UB. How would you define "trapping" so that the behaviour of implementations that do trap, but handle traps in particular ways, or may not even have the ability to handle traps, or whether a trap is possible to handle depends on arbitrary state, etc. such that there isn't a valid interpreation where the result is UB, or effectively UB.

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB.

Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use #[cfg] flags and such to do the right thing on multiple platforms or implementations.

I did bring this up in #84 (though I will concede it was off-topic there), where the layout rules of enums are unspecified, but with niche-optimizations, it was possible to devine an interpretation where unspecified behaviour (relying on the unspecified layout of repr(Rust) enums) could be elevated to undefined behaviour. Specifying something as unspecified behaviour that can result in undefined behaviour is the same as calling it undefined, except now it's hidden under interpreting the specification with a language lawyer hat, which is less fun for regular programmers I'm sure.

Right now, writing code for repr(Rust) is very hazardous, for exactly this reason. It's not literally UB if you get it right, but it may as well be for the programmer because very little about it is stably guaranteed. Instead, there are things like the layout API that allow you to access this information in a more portable way, and ideally this would be good enough that there is no reason to make risky guesses about the layout because the safe and stable alternatives are in place.

@Diggsey

UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague.

I disagree. UB (or the "language contract" as Ralf says) is a contract between the programmer and the compiler. A vague definition helps neither party, and may in fact lead to a miscommunication, which is a failure of the spec. A clear UB is at least informative for the programmer (and may additionally simplify their mental model, so it's not necessarily a negative), and it enables more optimizations for the compiler (and a simpler model is also good for the compiler writer to avoid bugs).

RalfJung commented 3 years ago

Without having time right now to respond to all the points:

In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.

If the Rust standard library ever invokes UB, that is a critical bug -- please report it if you find such a case. It is certainly avoidable to do so, and it is a huge problem if we do so for all the usual reasons that UB is bad. (There are some known cases of this, but we do consider those bugs that we want to resolve, and efforts/discussions are underway to achieve that.) I think this approach is necessary for building a reliable foundation of the ecosystem. (We could of course do things that are UB in Rust [the language] but not UB in rustc [the compiler]. For the reasons mentioned above, there are no such things.)

It is true that some of our docs are or have been imprecise about the distinction between UB and unspecified behavior, and also sometimes about the distinction between library-level UB and language-level UB. I am trying to fix such cases as I see them.

I disagree. UB (or the "language contract" as Ralf says) is a contract between the programmer and the compiler. A vague definition helps neither party, and may in fact lead to a miscommunication, which is a failure of the spec. A clear UB is at least informative for the programmer (and may additionally simplify their mental model, so it's not necessarily a negative), and it enables more optimizations for the compiler (and a simpler model is also good for the compiler writer to avoid bugs).

I think what @Diggsey meant is not that we should be vague about something being UB or not, but that saying "X is UB" is vague about what happens when X occurs. More vague than any other thing we could say.

chorman0773 commented 3 years ago

I'm not exactly sure what about this makes it impossible to implement, unless you are talking about typed memory

The JVM implementation relies on typed memory, and strict-aliasing to avoid having to emulate memory.

If the Rust standard library ever invokes UB, that is a critical bug

inherently, the careful use of UB is inevitable in a standard library, but as mentioned, the fact it's the standard library means it can if it wants, it just needs to get the compiler to do what it needs. Hopefully, it is impossible to fully implement the standard library in the language itself, sometimes this means the use of compiler intrinsics, sometimes this means , sometimes this means the use of things strictly specified as UB.

We could of course do things that are UB in Rust [the language] but not UB in rustc [the compiler]. For the reasons mentioned above, there are no such things.

This is the UB I am referring to here. UB in the language, but which an extension of the particular compiler permits.

UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague.

At least a specification of UB is not vague that it is UB, which is what I was referring to. It's worse if its not outright said "don't do X, X is UB" then if you have "the behaviour of X is unspecified" and constrain it in the vaguest possible way where a valid interpretation of the constraints allows X to have UB, because that means that X is UB, it just doesn't outright say it. This is worse, as I say, because it's harder to realise that it is UB.

I would add that the signed integer overflow UB actually has had real performance benefits to actual code in the field. From a cppcon talk, which I could probably look up if people wanted it, there was some rather hot code that was using unsigned as a loop control, which had a performance regression when it was compiled on x86_64, and the regression was fixed by changing unsigned to int. This is one of the reasons why the "Signed Integers are 2s complement" TS that was approved for C++20 explicitly elected to not define signed integer overflow, when it had the chance to.

also sometimes about the distinction between library-level UB and language-level UB

I'm sure I've made my position on this clear, but for completeness, I really hate the distinction, because it makes it easier to reason about UB (which is rule number 1 in my rules for UB "Do not reason about UB"). The biggest footgun in C++ is not when people don't know about some arbitrary piece of UB, it's when people think they are smarter than the compiler, and try to justify a particular kind of UB (I would know, I've tried this before. It didn't end well, hense my rules of UB).

scottmcm commented 3 years ago

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations

I'm not convinced by that. Certainly for some things it was more about portability, but I think optimizations have been core from the beginning.

My go-to example: One of the very first things that people wanted compilers to do was register allocation for local variables. Without that optimization things would have to be loaded and stored to the stack all over the place, which would be terrible for runtime performance. But doing that requires making certain things undefined behaviour -- int a, b; can't go in registers if (&a)[1] = 2; is defined to update b.

digama0 commented 3 years ago

My go-to example: One of the very first things that people wanted compilers to do was register allocation for local variables. Without that optimization things would have to be loaded and stored to the stack all over the place, which would be terrible for runtime performance. But doing that requires making certain things undefined behaviour -- int a, b; can't go in registers if (&a)[1] = 2; is defined to update b.

But couldn't this be handled the same way rust does layout optimization? That is, if you are lucky and guess the compiler's playbook then you can safely update b this way but if you miss and hit the wrong thing then it's UB. (And if b is in a register then of course you will miss.)

Lokathor commented 3 years ago

if you are lucky and guess

A lot of things can happen if you're lucky and guess.

Specifically, the outcome of UB might be what you expect. It's always possible that the UB doesn't come back to bite you when using a particular compiler, on a particular set of flags, on a particular ... and so on. But what exactly happens is always up in the air, which is why as a user of the language/compiler you need to avoid UB if you want reliable compilations.

But in that particular example with a and b, I can't imagine much good happening. You can't really reason about b (eg: eliminating a duplicate load, or holding off on a store, etc) if you're also allowed to access it via a.

comex commented 3 years ago

But couldn't this be handled the same way rust does layout optimization? That is, if you are lucky and guess the compiler's playbook then you can safely update b this way but if you miss and hit the wrong thing then it's UB. (And if b is in a register then of course you will miss.)

In early C compilers? Yeah, it probably could be handled that way.

You might already realize this, but it couldn't be handled that way in modern compilers without needlessly sacrificing optimization potential. As a simple example:

if b >= 0 {
    do_something_with(&mut a);
    if b < 0 {
        do_something_else();
    }
}

Assuming b did not have its address taken, the compiler would like to delete the second if as dead code, since the condition can never pass. (This kind of useless code often shows up after inlining other functions.) But under the "guessing the compiler's playbook is OK" rule, if b happens to get spilled to the stack, do_something_with would then be allowed to reach out and touch b by indexing from a, making the optimization illegal.

comex commented 3 years ago

@chorman0773

Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined.

It didn't have to be; it could have been implementation-defined. For example, while the C standard makes most kinds of overflow either undefined or well-defined, there is one exception. If you cast from a larger integer type to a smaller one and the value can't fit into the smaller type, the standard says: "either the result is implementation-defined or an implementation-defined signal is raised." (C11 6.3.1.3.3) This gives the implementation an extraordinary amount of flexibility, while not going all the way to "undefined behavior".

In arm, signed integer overflow causes a hardware trap

It does not.

chorman0773 commented 3 years ago

That is, if you are lucky and guess the compiler's playbook

Yeah, that's a briliant idea. It works fairly well in debug, so what can go wrong.

Sarcasm aside, the only time its OK to use UB, is if you are in a situation with a particular compiler, or a particular set of compilers, and you know that the compiler assigns a particular meaning to the particular undefined behaviour, either because you are very closely tied to the compiler (standard library or compiler support library) or you have a documented extension (again see my union example). "Guessing" what the compiler does falls under reasoning about UB.

But doing that requires making certain things undefined behaviour -- int a, b; can't go in registers if (&a)[1] = 2; is defined to update b.

I will concede that one is likely for that UB, but an decent amount of UB in C and C++ has justification further than that.

either the result is implementation-defined or an implementation-defined signal is raised." (C11 6.3.1.3.3)

A signal wouldn't be the same as a trap, trapping doesn't need to result in a signal.

It does not.

Huh, I thought it was one of the examples of where signed integer overflow was trapped at a hardware level (I do know such processors exist).

RalfJung commented 3 years ago

@chorman0773

I would still say its a good idea to define the term somewhere, as to avoid issues with interpretation.

I am always in favor of defining terms. :) As mentioned before, UB is defined in our glossary; if you have suggestions for improving that definition, please let us know!

It may be reasonable to include such a note, or some acknowledgement that a valid implementation may assign actual meaning to the behaviour. In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.

As noted above, we do not want to encourage implementations to actually do that. Also I strongly disagree with it being unavoidable. In Rust, we are avoiding relying on "UB in the spec but the compiler guarantees a specific behavior" (modulo bugs), so we have constructive evidence that it is possible to build a language that way. And this is the way I (and I think I am not alone in the UCG and the lang team to think so) would prefer other Rust implementations to go as well. Certainly I see no reason that we should explicitly cater to another approach. (We shouldn't explicitly forbid it, either, but nobody has been suggesting that.)

I do not think it is the role of a spec to point out that one could derive other, adjusted specifications for it. That is just obviously true for every document. These derived specifications are separate dialects of Rust. The purpose of the UCG is to spec out the "main language", not to figure out the design space for dialects. At least, I personally have little interest in developing such dialects, and I think the UCG has enough on its plate without that additional mandate. And finally, discussion of such dialects should, even when it occurs, be kept strictly separate from the "main language". We should not mix up what ends up in a Rust spec and what ends up in the spec of some derived language dialects that some hypothetical future implementations might choose to implement instead.

Sometimes yes, there are some instances when this becomes necessary. For example, low level code may need further guarantees the language does not provide, and that is why such extensions exist.

Again I disagree that this is necessary. So far the approach of the lang team and UCG has always been to instead work with the people writing that low-level code, figure our their needs, and see how we can accommodate them without creating language dialects. I firmly believe that this is the better strategy, and I see no reason to think that it would not work. Both sides (language designers and low-level programmers) gain a lot when we can avoid splitting off a "low-level dialect" of Rust.

Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that reinterpret_cast<T*>(&u) where u has a different type U may not be a bitwise identity) may change arbitrarily depending on types.

There were versions of the C spec before strict aliasing. So no, that is not the reason. C could just specify that when the types of stores and loads do not match, the bit-pattern is interpreted at the other type. C provides ways to do this, e.g. through memcpy, so the spec anyway has to allow that possibility and decide how much it wants to say about it (which is not a lot, but that's fine).

Literally the only reason C has strict aliasing rules is to enable more optimizations. If they removed strict aliasing from the spec, there wouldn't be any gaps or open questions created by that. (There'd be a lot of open questions removed actually.^^) This is also demonstrated by the fact that all major compilers have a flag like -fno-strict-aliasing that opts into "C without strict aliasing", where they guarantee that type-punning loads just re-interpret the bits at the new type (which might be UB or not, subject to the usual rules for type punning that the language needs to have anyway).

My question is how you define trapping

There could be all sorts of things you can say without saying it is UB -- things like aborting program execution, or signal handlers (which the standard does talk about). A program that traps will not arbitrarily jump into otherwise dead regions of code, which UB might well do; I don't think it would be too hard to come up with a reasonable list of possible behaviors here.

inherently, the careful use of UB is inevitable in a standard library

You keep saying that, but it is just not true.^^ Rust proves otherwise (modulo bugs).

I would add that the signed integer overflow UB actually has had real performance benefits to actual code in the field. From a cppcon talk, which I could probably look up if people wanted it, there was some rather hot code that was using unsigned as a loop control, which had a performance regression when it was compiled on x86_64, and the regression was fixed by changing unsigned to int. This is one of the reasons why the "Signed Integers are 2s complement" TS that was approved for C++20 explicitly elected to not define signed integer overflow, when it had the chance to.

FWIW, in Rust this particular example does not carry over -- the reason signed integer overflow UB helped here is that people use int for array index arithmetic on a 64bit machine. In Rust, nobody would do that; you'd use usize, and then there is no longer any performance benefit from making overflow UB.

@digama0

I don't see why undocumented things would ever be a good idea. Unstable things might be, and really I think most of rustc is currently in that category: all the nightly features are clearly not UB but also not among the (very few!) actually stable and defined behaviors that are in the UCG document.

(This was about unspecified vs implementation-defined behavior.) I think there is little point to precisely documenting how rustc currently happens to lay out its structs and enums... but if someone wants to do that work, and if others find it useful, sure. :)

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior.

I don't think that would be a good idea. At that point programmers have to basically treat this as UB. So this is effectively equivalent to saying it is UB but some implementations making stronger guarantees about it, which is not a good idea for all the reasons mentioned before.

In fact, if you take it as a given that implementations may guarantee specific behavior for UB, then if you allow implementation-defined behavior to be "implemented as UB" you just made it equivalent to UB. So no, I strongly disagree, "UB" should not be on the list of things an implementation may choose from for unspecified or implementation-defined behavior.

I should add that I think there are examples of UB that are not motivated by optimizations but by "there's literally nothing better we can say". For example, taking a random integer, casting that to a function pointer, and calling the function. There is no reasonable way, in the Abstract Machine, to bound the behavior of such a program. But those cases are by far in the minority for UB, both in C/C++ and in Rust. It would be trivial to precisely describe Rust, C, and C++, and to have a Miri-like checker for them, if this was the only kind of UB that we had.

digama0 commented 3 years ago

I don#t think that would be a good idea. At that point programmers have to basically treat this as UB. So this is effectively equivalent to saying it is UB but some implementations making stronger guarantees about it, which is not a good idea for all the reasons mentioned before.

I agree that for programmers aiming for full portability to any conforming implementation, this may as well be UB. However it differs from UB in that you can use it selectively if you happen to know more about the particular implementation, e.g. using #[cfg] to use an operation on a platform/implementation for which it makes sense, and avoiding it for "generic" implementations where it is UB without loss of generality.

In fact, if you take it as a given that implementations may guarantee specific behavior for UB, then if you allow implementation-defined behavior to be "implemented as UB" you just made it equivalent to UB. So no, I strongly disagree, "UB" should not be on the list of things an implementation may choose from for unspecified or implementation-defined behavior.

I take your point. Although it makes me wonder what the role of #[cfg] is in the language then: if an operation makes sense for some configurations but not others, and is #[cfg]'d in only on platfoms or implementations where it is defined, then is that a valid Rust program? It seems that by your reasoning such an operation would have to be defined as UB in the abstract machine, meaning that the full program, including the #[cfg]s, can't be considered a valid Rust program, even though it only exercises valid implementation-defined behavior (which we have agreed can't be called such because some implementations make it UB).

RalfJung commented 3 years ago

I take your point. Although it makes me wonder what the role of #[cfg] is in the language then: if an operation makes sense for some configurations but not others, and is #[cfg]'d in only on platfoms or implementations where it is defined, then is that a valid Rust program? It seems that by your reasoning such an operation would have to be defined as UB in the abstract machine, meaning that the full program, including the #[cfg]s, can't be considered a valid Rust program, even though it only exercises valid implementation-defined behavior (which we have agreed can't be called such because some implementations make it UB).

The Abstract Machine has parameters, for things like pointer size and endianess. cfg lets a program query those parameters so that a single program can run for multiple different choices of those parameters.

Also, I'd say UB is a dynamic, run-time concept, and as such always refers for a program "after cfg expansion"; in that sense cfg is similar to macros -- by the time we think of a program running on the Abstract Machine, both have already been expanded away. For a pre-expansion program we can only ask "for which choices of cfg flags is this program UB". There we can exploit that some cfg flags correlate with Abstract Machine parameters (see above).

digama0 commented 3 years ago

I agree with all of the above. My question is what if there is an operation which is UB for some configurations and not others (for example, calling a CPU intrinsic). Does the abstract machine need to know about all these configurations, in order to specify them? I was hoping that this could be classed under "implementation-defined behavior" or "platform-dependent behavior" so that the abstract language doesn't need to contain the union of all quirks from platforms it has ever compiled to.

comex commented 3 years ago

There were versions of the C spec before strict aliasing.

Nitpick: No there weren’t. Strict aliasing was already in the first standardized version of C, C89, though most people didn’t know about it until GCC started enforcing it in 2001.

Edit: But it is true that it exists solely to enable compiler optimizations.

chorman0773 commented 3 years ago

if you have suggestions for improving that definition, please let us know!

For the more informative UCG, I think its good as it is. When rust does get arround to writing a proper specification, something more akin to what C and C++ have, IE. something like:

Behaviour for which this specification imposes no limitations.

As noted above, we do not want to encourage implementations to actually do that.

It's not necessarily encouraging implementations to do that, in giving an example of what can happen. I also equally mention that the construct can be ignored/evaluated as-is, potentially interfering with other well-defined constructs. C and C++ both in a note that a valid response to UB is to assign it some arbitrary meaning. It equally means, if you really want to use this construct, you shouldn't, but seek out your compiler's documentation first as they may say you can.

And this is the way I (and I think I am not alone in the UCG and the lang team to think so) would prefer other Rust implementations to go as well.

In lccc, we inherit some things that aren't UB from C and/or C++, usually because don't care enough about the particular optimizations to add further tags saying when certain things are UB and when they are well-defined (conversely there are some things in C and C++ that are well-defined under lccc because rust says they are and I don't want to duplicate accross them). This isn't horribly new, gcc has a bunch of extensions to both C and C++ that exist in one primarily because the other allows the same (gcc lets you type-pun with unions in C++, and clang does as well primarily because gcc does). For example:

#[repr(C)]
struct StandardLayout{
    num: i32,
    other_field: f32
};

fn do_thing(v: &mut StandardLayout){
    let x = &mut v.num;
    let y = unsafe{&mut *(x as *mut i32 as *mut StandardLayout).other_field)};
}

in SB, that is UB, because you have exceeded the provenance of x. In lccc however, it's well-defined because of pointer-interconvertibility and reachbility rules. Specifically, you can reach v.other_field from v.num because you can reach *v from v.num (as it's pointer-interconvertible). I actually intend to exploit this in my implementation of RefCell, so I can reduce Ref and RefMut to being equivalent to a single pointer (knowing what is effectively the outer RefCell<T> can be reached from a pointer to the inner T, and can reach the ref-count from that, the actual RefCell<T> is a repr(rust) wrapper arround the inner, where I promise the same amount as C++ does for things that aren't Standard Layout Types, which is aboslutely nothing). For things like these, where the benefit from enabling them is limited, so documenting the extensions makes more sense then hiding them (especially since you could derive them from the ir specification).

You keep saying that, but it is just not true.^^ Rust proves otherwise (modulo bugs).

Fair, I will concede that point. However, standard libraries do sometimes use UB in the language proper, either because they have to, or the be efficient/clever. libc++ and libstdc++ are definate example (I can't remember exactly where, but I remember seeing some in). As mentioned the rust standard library implementation for lccc will make use of pointer-interconvertibility for manual layout optimizations. More to the point, standard libraries are in a privileged position where they can make things not UB because they want to do something that is. Same with compiler-support libraries, which is less likely to be able to avoid UB, which is why they exist. Neither libunwind nor libgcc_s are particularily well-defined when it comes to unwinding (I can't really think of a way to implement stack unwinding at all absent some undefined behaviour, aside from using pure assembly, certainly not for itanium). This is why I consider standard and compiler support libraries some of the only exceptions to my otherwise absolute rules of UB, including good ideas such as "Do not reason about UB" and "Do not rationalize UB".

FWIW, in Rust this particular example does not carry over -- the reason signed integer overflow UB helped here is that people use int for array index arithmetic on a 64bit machine.

Indeed, though the exact case was it was using unsigned, which is what caused the performance regression (replacing it with int fixed it). usize is kind of an annoying type in Rust. It works well on well-behaved platforms, like x86-64, and x86. However, saying that the size type (and index type) is the same size as a pointer is kinda wasteful on older platforms like 65816 (the maximum object size is 65535 bytes, but pointers are 24-bits), and leaves some holes open since rust effectively says all distinct object representations of integer types represent a distinct, valid value.

Edit: But it is true that it exists solely to enable compiler optimizations.

reinterpret_cast in C++ notes that pointers of different types can be represented differently (particularily, std::bit_cast<T*>(u) and reinterpret_cast<T*>(u) are not required to have the same value, or even have the same object-representation). If there is a difference in representation between a int* and float*, how would you suggest an the access of type float to an object of type int? The example I use for this is the JVM implementation, where strict-aliasing allows me to avoid emulating memory. The ability to enable differing implementations is lesser for strict-aliasing, than say signed overflow, but it is present.

Also, I'd say UB is a dynamic, run-time concept

It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details). In general, it's not. Examples of this include the prohibition in C++ against instantiating standard library templates with incomplete types (Yes, the C++ compiler is allowed for format your hard drive when translating the program, which is arguably hilareous). As I say, UB is literal, and it doesn't particularily matter when the UB happens. I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that).

chorman0773 commented 3 years ago

Adding to this

I think there is little point to precisely documenting how rustc currently happens to lay out its structs and enums

I agree, this is a poor idea

I don't think that would be a good idea. At that point programmers have to basically treat this as UB

There is a term for this: conditionally-supported behaviour. Requires implementations to document when they do not support the behaviour. There are some cases where it can actually be useful, for example, I would like to look into volatile access to address 0 as conditionally-supported with implementation-defined results, as it can be an asset to embedded devs. Certainly not should unspecified behaviour permit undefined behaviour. I do agree, most programmers should treat conditionally-supported behaviour as if something to not touch (whether being unsupported means UB or a compile error), and it doesn't give too much of a huge benefit over just being UB, and letting the compiler decide if it does want to give you w/e behaviour.

digama0 commented 3 years ago

You have given a bunch of horror-story examples of terrible uses of UB in C/C++, and I don't find them particularly compelling for adoption in Rust. UB at translation time is just really obvious compiler-developer-pandering. We already try very hard to be able to find all uses of UB at runtime, so if there were compile time UB I would expect nothing less than to have a "dynamic checker" for that too; but dynamic checking at compile time is just compile time checking so it just ends up as part of the compiler's workings, and so it's not UB after all.

I don't think that the stock C/C++ wording

Behaviour for which this specification imposes no limitations.

is very good either, because it does not at all elucidate the way in which UB is used, as a dynamical concept of "stuck state" in the abstract machine. In fact, I would be happy with just such a description:

We say that a program has undefined behavior when there is a possible execution trace that ends in a "stuck state", which is where the Rust Abstract Machine has no valid execution step to take (either regular steps or I/O). Compilers are only required to preserve the behavior of programs which do not exercise undefined behavior, and must not introduce undefined behavior, but specific implementations may choose to assign meaning to programs with undefined behavior (with or without attendant documentation).

digama0 commented 3 years ago

in SB, that is UB, because you have exceeded the provenance of x. In lccc however, it's well-defined because of pointer-interconvertibility and reachbility rules.

To head off Ralf's exasperated comment: This is fine and your prerogative as the designer of lccc, but not the business of the UCG.

usize is kind of an annoying type in Rust. It works well on well-behaved platforms, like x86-64, and x86. However, saying that the size type (and index type) is the same size as a pointer is kinda wasteful on older platforms like 65816 (the maximum object size is 65535 bytes, but pointers are 24-bits), and leaves some holes open since rust effectively says all distinct object representations of integer types represent a distinct, valid value.

I think it is a deliberate choice of Rust to not attempt to accomodate older architectures that differ considerably from modern hardware. We've all seen that C suffers greatly from the baggage that it carries from that era, and no one wants to keep carrying that forward if the processors are no longer in use.

reinterpret_cast in C++ notes that pointers of different types can be represented differently (particularily, std::bit_cast<T>(u) and reinterpret_cast<T>(u) are not required to have the same value, or even have the same object-representation). If there is a difference in representation between a int and float, how would you suggest an the access of type float to an object of type int? The example I use for this is the JVM implementation, where strict-aliasing allows me to avoid emulating memory. The ability to enable differing implementations is lesser for strict-aliasing, than say signed overflow, but it is present.

To emulate Rust in the JVM, you almost certainly have to emulate memory. You might be able to do various kinds of program analysis to hoist values out of memory but that's all subject to the as-if rule, and the R-AM works on flat, untyped memory. (Personally, I think that C++'s casting mechanisms are far too complicated. Rust has a simple and intuitive model of memory, even if it makes it harder to concretely represent the memory in other ways.)

Also, I'd say UB is a dynamic, run-time concept

It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details).

Why don't you just reserve a crate on crates.io? The standard library and rustc are all stuffed in the std crate, so you could do something similar.

I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that).

This is more interesting. That conformance clause doesn't currently exist in Rust AFAIK, and it does seem odd to me that we should require that you give a diagnostic for use of lccc extensions of Rust. But this is probably best suited for its own issue.

RalfJung commented 3 years ago

@comex

Nitpick: No there weren’t. Strict aliasing was already in the first standardized version of C, C89, though most people didn’t know about it until GCC started enforcing it in 2001.

I stand corrected; thanks for pointing that out.

@chorman0773

When rust does get arround to writing a proper specification, something more akin to what C and C++ have

I honestly don't think "Behavior for which this specification imposes no limitations" is very informative, given how often it misleads people, and it is more useful to talk explicitly about the Abstract Machine and that the implementation expects the programmer to uphold its side of the contract. That is, in my opinion, a better framing and phrasing of UB than what C and C++ do.

We can clarify that as a consequence, there are no limitations to the behavior of a program that violates said contract. In fact I think we already say that:

If it turns out the program does have undefined behavior, the contract is void, and the program produced by the compiler is essentially garbage (in particular, it is not bound by any specification; the program does not even have to be well-formed executable code).

But if you think it is helpful to explicitly say "no limitations" and not just "garbage", that is fine for me, too.

But anyway that is a separate bikeshed.^^

I also equally mention that the construct can be ignored/evaluated as-is, potentially interfering with other well-defined constructs. C and C++ both in a note that a valid response to UB is to assign it some arbitrary meaning. It equally means, if you really want to use this construct, you shouldn't, but seek out your compiler's documentation first as they may say you can.

I would say that this is a case of C/C++ encouraging implementations to assign some arbitrary meaning, which I think we should not do for Rust. But this is getting extremely subjective and we clearly have different positions here, so I doubt we will resolve the dispute by repeating our positions. ;) We'll probably have to agree to disagree, and when it comes to wording the final standard, there'll be more people involved and we can see what they think.

In lccc, we inherit some things that aren't UB from C and/or C++, usually because don't care enough about the particular optimizations to add further tags saying when certain things are UB and when they are well-defined (conversely there are some things in C and C++ that are well-defined under lccc because rust says they are and I don't want to duplicate accross them).

I obviously cannot stop you from doing whatever you want with your own project. I think I stated my point for why providing such guarantees to Rust code on some implementations risks an ecosystem split. On the other hand, having a unified semantics with C/C++ does require some very different trade-offs.

What I do not understand is how you think this should affect the UCG. Doing better than C/C++ is explicitly one of my goals, so I'd be quite opposed to any attempt to unify UB with those languages.

However, standard libraries do sometimes use UB in the language proper, either because they have to, or the be efficient/clever. libc++ and libstdc++ are definate example (I can't remember exactly where, but I remember seeing some in).

Yes, this definitely sometimes happens, it's just something we'd like to avoid in Rust proper. Again I cannot tell you how to build your own compiler, so if you think this is a good strategy, I will respectfully disagree and we can go our separate ways. ;)

Looks like you are set on defining a Rust dialect that makes some extra guarantees. I am not terribly happy about that but respect your decision. Again I am not sure how this should impact UCG work -- as long as we don't want to define any behavior that you need to be UB, you should be good, right?

This first came up around validity of references, but given that you must support C-style pointers that point to garbage (I don't think it is UB in C to have a bool* that points to 0x11), your IR clearly has to be able to support "pointers that point to invalid data". In fact, you must certainly support this for Rust raw pointers. Therefore Rust permitting this possibility for references should not impose any new constraints on your language design. Am I missing something?

More to the point, standard libraries are in a privileged position where they can make things not UB because they want to do something that is.

I'd phrase this more carefully... the compiler has to have a uniform notion of UB across all code (otherwise things like inlining are broken). So what standard libraries can do is exploit the knowledge that something is not really UB in the actual language implemented by this compiler, even though the language spec documents it as UB. This is very similar to exploiting knowledge about unspecified implementation details. Code outside the standard library could in principle do the same, but then it would be tied to only work with a particular version of the compiler.

IOW, the privilege of the standard library comes solely from being compatible with exactly one version of the compiler, and being able to rely on undocumented aspects of the compiler because it is maintained by the same people. In contrast, user code has to be compatible with a wide range of compiler versions.

If there is a difference in representation between a int and float, how would you suggest an the access of type float to an object of type int?

At least in C, it is legal to do union-based type punning under some circumstances. So the answer is "the same as that".

In C++, the answer is "the same as a reinterpret_cast from int to float".

It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details). In general, it's not. Examples of this include the prohibition in C++ against instantiating standard library templates with incomplete types (Yes, the C++ compiler is allowed for format your hard drive when translating the program, which is arguably hilareous). As I say, UB is literal, and it doesn't particularily matter when the UB happens. I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that).

I think that's just C++ being silly.^^ There is also some UB in the C preprocessor if I recall. But that is, on a formal/technical level, very different from the kind of UB that is used for optimizations, so they should really not use the same term.

@digama0

To head off Ralf's exasperated comment:

Sorry for that. I tried to tone it down, but clearly not enough. Maybe I should take a break from this thread; I have stated my case.

chorman0773 commented 3 years ago

I don't think that the stock C/C++ wording

You'd be right, it has international standard instead of specification, and I believe it's poses, not imposes. Nitpicks aside, absent specific word differences, that's right from one of the C1x Committee Drafts for ISO 9899.

Why don't you just reserve a crate on crates.io

That is a plan, though people can still create crates with names not acceptable or reserved on crates.io (and the impact is a lot harder to measure.

I honestly don't think "Behavior for which this specification imposes no limitations" is very informative, given how often it misleads people

It's intended to be the normative wording, defining the term to have the broadest requirements. I'd argue it is never a specification's place to say "you can optimize in this way" unless the optimization affect's the observable behaviour of a program subject to it (such as C++'s copy-elision or the guaranteed niche-variant optimization applied to Option), or the broader as-if rule (which is actually where most UB-related optimizations originate). The UCG wording can be kept, and in the future spec, it can include an informative note of a similar (though probably reduced) form (both C and C++ do this as well). Talking about a contract between a compiler and program is generally redundant in a spec/standard, because the document is that contract.

So what standard libraries can do is exploit the knowledge that something is not really UB in the actual language implemented by this compiler

Here I'm referring to something documented by the language as UB, it's still UB, the compiler just happens to have assigned meaning to it, either silently or in a documented way. It's still the same language as it's something the language permits, defining UB doesn't alter any observable behaviour of a correct program.

Again I am not sure how this should impact UCG work

Primarily it doesn't. It's primarily in seeking clarification that I can do this, and using expository arguments as to why one may seek to do something. There is always the possibility that some UB might get defined in a later version, and that's something compiler devs who write extensions have to live with (though the way rust currently treats pointers leads me to believe it doesn't necessarily matter that I extend the reachability of a pointer, rust probably won't interfere). However, in C and C++'s case, particularly in the C++ case, when determining whether or not to remove UB, part of the question to compiler devs is presumably "will this affect optimizations" or "are you already doing this in a particular way". Sometimes the latter point can be beneficial. For example, I believe it would be a lesser burden if someone wanted to allow type-punning in unions in C++, because it's what at least 2 of the 3 major compilers already do (idk if MSVC does). Unifying UB concepts with C++ isn't necessarily horrible, when most of that is simply stating things aren't UB (I can't just make something UB), and as I say, it's primarily done when the optimizations it enables don't justify extending the ir further (if it were a strict intersection of undefined behaviour, it'd be notably worse than any optimizing compiler for any of the 3, simply by the fact I'd lose both strict-aliasing and reference aliasing rules)

I don't think it is UB in C to have a bool* that points to 0x11)

I believe _Bool is required to be 0 or 1 (represented in an unspecified way), any other object-representation may be a trap representation (equivalent of invalid value). You can have a pointer to it, but you cannot read it w/o UB. Raw pointers are different from references however. I will concede it's probably a lesser concern (I believe there is nothing prohibiting a C++ reference to an indeterminate value), though it only exists in the presence of niches, which really doesn't exist in C++ references. The other concern raised on #77 and continued on #84, however, is advantageous, as the lack of a resolution actually disables a very nice optimization for niche_optimized types (which is of course reachability analysis for potentially-elided discriminants).

the same as a reinterpret_cast from int to float

So a compile-time error (Casts to a floating-point type aren't permitted by reinterpret_cast). Jokes aside, my exact point is that reinterpret_cast is not required to be a bitwise no-op (it's more akin to a subset of as casts then to transmute, the latter is the C++20 std::bit_cast). The exact case is when there is something grossly different about the representation of float*, and the representation of int*, a float* produced by reinterpret_cast may not even look like a valid value of either, so how would you even know what int value you point to? How would you even know you actually point to an int value? Strict-aliasing is the final piece of the puzzle that allows non-unified storage for object types. Rust doesn't have it, which is reasonable, but it means a non-unified implementation can't really exist.

But that is, on a formal/technical level, very different from the kind of UB that is used for optimizations, so they should really not use the same term

On a formal/technical level, UB is the definition I provided above, behaviour not limited in any way. It is very much intended to use that term. Though, imagining what kind of havoc you could wreak by #define int float, the cpp UB is very much justified. There is also quite a bit you can cause when you start querying properties of incomplete class types while completing them (Maybe whether or not I have a particular constructor depends on if the compiler thinks so, and it's defined if it doesn't and SFINAEs out if it does).

no one wants to keep carrying that forward if the processors are no longer in use

Older/different processors are used widely in specialized tasks. I'm working on a platform to write homebrew for the Super Nintendo Entertainment System using modern programming techniques. One goal is to be able to use the safety of rust to avoid some of the more interesting bugs of games written in that era. Another, more abstract, goal is to learn more about the particular processor and how it interacts with high-level programming concepts.

chorman0773 commented 3 years ago

By the way, I would ask how there is considerable difference between defining a narrow branch of UB and documenting it, and making unstable features available and documenting them. Inherently there is already an ecosystem split, where some people use stable, and some people use nightly and opt themselves into future compat issues, that's the point of unstable features, so people can try them out on nightly and give feedback. And beyond that, implementation details are covered by unstable. For example, the rustc unstable book documents the lang_item feature, even though that's intended never to be stabilized as part of the rust language. And rustc has a number of less well-documented features (such as the stability_attributes feature, which is not in the unstable book, and rustc_private, which simply links a tracking issue). I'd argue that defining UB is similar to defining features. You generally shouldn't use them, but you can if you wish or need, under the full knowledge you potentially lose portability and future compatibility.

digama0 commented 3 years ago

On a formal/technical level, UB is the definition I provided above, behaviour not limited in any way. It is very much intended to use that term.

This is not correct though, as it does not explain the acausal nature of UB based optimizations. Undefined behavior in C/C++ is not "behavior" at all, it's an implied precondition that is used, among other things, for optimization.

I think that the thrust of Rust UB being dynamic is that it is actually a behavior, there is an execution semantics and from that semantics the concept of "stuck state" is well defined; a program has UB if it reaches a stuck state under some choice of the available nondeterminism. I don't see any other way this can be formally cached out, and I would want a spec to go along roughly those lines. In particular, as mentioned earlier, this implies that the compiler does not ever have the freedom to miscompile a program with too much X in it, or one that tries to break the parser, or violates the ODR or anything like that. I'm not sure how close we are to that ideal, but I think that would be a huge improvement over the C/C++ status quo.

no one wants to keep carrying that forward if the processors are no longer in use

Older/different processors are used widely in specialized tasks. I'm working on a platform to write homebrew for the Super Nintendo Entertainment System using modern programming techniques. One goal is to be able to use the safety of rust to avoid some of the more interesting bugs of games written in that era. Another, more abstract, goal is to learn more about the particular processor and how it interacts with high-level programming concepts.

Of course there are downsides to this kind of abandonment of older and more esoteric processors, especially in the embedded space where processor designs vary a lot more. Certainly that isn't a decision that UCG can make though, as it basically comes down to the tier-1, tier-2 etc platforms supported by the language and compiler. It's a bit premature to talk about language spec evolution, but I can imagine that the spec can be changed to accommodate future platforms if compiler support improves enough to make it a reasonable prospect (and the cost of the spec change is considered better than the alternatives).

RalfJung commented 3 years ago

It's intended to be the normative wording, defining the term to have the broadest requirements.

The requirements are just as broad when phrasing it as a precondition that the compiler assumes to be satisfied by all program executions. The difference is that the Abstract Machine definition gives some insight into what (most) UB actually is and how it can be made formally precise when giving a mathematical spec (as opposed to an informal spec in English such as the C/C++ standards). That's why I think it is better than the C/C++ wording.

I'd argue it is never a specification's place to say "you can optimize in this way"

We agree on that. The specification says "here's an Abstract Machine that defines how exactly a program is permitted to behave"; the job of the compiler is to ensure that the compiled program only exhibits behavior that are permitted by the abstract machine.

The weird thing about UB in C/C++ is that it lives separate from the abstract machine, when really it can be made fully precise by letting the abstract machine say when a program has UB. That is what researchers do when they try to make C/C++ mathematically precise. It is the tried-and-proven way to actually do formal verification of C/C++ programs. For Rust I propose we do this work ourselves instead of expecting others to do it. We should explicitly spell out that Abstract Machine, its state space, its transitions, and when it considers a program state to be invalid / UB / "a stuck state".

If C/C++ did that, a lot of ambiguity would be resolved. The C/C++ spec is a wishlist for a lot of properties that an Abstract Machine can have (and it is not even clear that a single Abstract Machine can have all these properties at the same time) -- this is called an axiomatic specification. For Rust I think we should instead just say what the Abstract Machine is -- this is called an operational specification.

So a compile-time error (Casts to a floating-point type aren't permitted by reinterpret_cast). Jokes aside, my exact point is that reinterpret_cast is not required to be a bitwise no-op (it's more akin to a subset of as casts then to transmute, the latter is the C++20 std::bit_cast). The exact case is when there is something grossly different about the representation of float, and the representation of int, a float* produced by reinterpret_cast may not even look like a valid value of either, so how would you even know what int value you point to? How would you even know you actually point to an int value? Strict-aliasing is the final piece of the puzzle that allows non-unified storage for object types. Rust doesn't have it, which is reasonable, but it means a non-unified implementation can't really exist.

Okay, I clearly do not know enough about C++ here. I thought the entire point of reinterpret_cast was to be a NOP, like transmute in Rust. The C++ standard these days is so extremely complicated, I have long given up trying to follow its details. Also I am confused why you are talking about reinterpreting float* and int* when I was talking about float and int. To be clear, what I am saying is that without strict aliasing, this code:

int load_as_int(float *x) { return (int*)x; }

should have the same behavior as

int load_as_int(float* x) { return reinterpret_cast<int>(*x) }

The latter avoids strict aliasing violations by doing the memory access at type float.

But for C, the standard permits type punning through unions under some conditions. Without strict aliasing, the same rules could be applied for all type-punning loads. You say that some C++ compilers also permit union-based type punning; whatever the spec for that is, could be used for all sorts of type-punning (or "transmute" in Rust terminology).

Here's an easy way to develop a spec for (a variant of) C: try writing an interpreter for it. Always do the "obvious" thing. When there is no obvious thing to do (such as invoking a function pointer that points to data), make the interpreter raise an error. Sometimes we might want to give a finite list of options instead, e.g. for overflow. Sometimes you might want to add platform parameters, such as size and representation of integer types. Once the interpreter works, translate its behavior into English and translate all its error conditions into "this is UB". If you do this, you will end up with a version of C that has a bit of UB but much less than what real C has. In particular, this interpreter needs no pointer provenance, and no strict aliasing. All the remaining UB exists solely to enable optimizations. This requires adding extra state and extra checks to the interpreter until it makes enough programs UB such that optimizations do not affect the behavior of the remaining UB-free programs. Making this kind of UB precise in the Abstract Machine is hard, but it is exactly what one needs to do when one wants to actually give a good specification of UB.

I call a specification "good" if it is useful for reasoning, i.e., if I can use it to conclusively demonstrate ("prove") that a given program satisfies all the requirements imposed by the specification. That is ultimately the main purpose of a specification: to serve programmers, so that once they verified their code against the spec, they can be sure that every conforming implementation will run their program correctly. The C/C++ specifications are not good since they fail this test -- the requirements are not laid out precisely enough to be useful for such an argument. That's why people can discuss forever whether a certain program as UB or not. That's one reason why doing precise formal analysis of C/C++ code is extremely hard to impossible -- every new paper and tool basically needs to make up its own new Abstract Machine, and they are all incorrect in some ways. This situation can be avoided by making a spec sufficiently mathematically formal that we do not need discussions any more, we can just carry out proofs. An excellent example for how to do this is the specification for WebAssembly. They even have a reference interpreter! And this is also what we are striving for in Rust, or at least I am. It is the only way that we know actually leads to an unambiguous specification that is suitable for programmer's needs when verifying their code.

Hence I think we should avoid vague terminology like how C/C++ define UB, and instead concretely point out how the Abstract Machine has error states, how the implementation may assume that programs never reach those error states, and how programs that do reach the error states are said to have "Undefined Behavior". This is a definition that can be turned into maths, which is the level of precision we need for what I call a "good" specification.

C/C++ are in the unfortunate situation of carrying along a lot of luggage, on many levels. I am not blaming them for getting into the situation they are in. But for a new language such as Rust, I think we can do better -- in fact I think it would be embarrassing not to do better. We (the academic PL research community) learned a lot about formal methods for language specification over the last decades; let's use that knowledge and not ignore it.

I believe _Bool is required to be 0 or 1 (represented in an unspecified way), any other object-representation may be a trap representation (equivalent of invalid value).

I was talking about bool* though (or _Bool* I guess).

You can have a pointer to it, but you cannot read it w/o UB.

That is exactly what I am proposing for &bool in the "reference validity" discussion: the pointer may point to bad data, but if you then read it, that's UB.

Raw pointers are different from references however.

Indeed. References have to be dereferencable (i.e., point to allocated memory) and have aliasing restrictions; raw pointers do not.

The other concern raised on #77 and continued on #84, however, is advantageous, as the lack of a resolution actually disables a very nice optimization for niche_optimized types (which is of course reachability analysis for potentially-elided discriminants).

If you have a concrete optimization that is powerful enough to justify such UB, please post it in #84. Also I don't think you mean what is usually called reachability analysis in the literature.

Though, imagining what kind of havoc you could wreak by #define int float, the cpp UB is very much justified.

No, I do not think it is. Behavior should just be the exact same as what happened if I manually replaced int by float everywhere. That is likely UB, but maybe it isn't. Either way there is no reason at all to add a new UB clause for this.

comex commented 3 years ago

reinterpret_cast in C++ notes that pointers of different types can be represented differently (particularily, std::bit_cast<T*>(u) and reinterpret_cast<T*>(u) are not required to have the same value, or even have the same object-representation). If there is a difference in representation between a int* and float*, how would you suggest an the access of type float to an object of type int? The example I use for this is the JVM implementation, where strict-aliasing allows me to avoid emulating memory. The ability to enable differing implementations is lesser for strict-aliasing, than say signed overflow, but it is present.

Strict aliasing does not care about the object representations of pointers (except coincidentally when the object being aliased is itself a pointer, e.g. casting int ** to float **). Rather, it determines whether reinterpret_cast<T *>(u) is legal to dereference.

For example, if u points to any valid non-zero-sized object, reinterpret_cast<char *>(u) is always legal to dereference (because character types are an exception to strict aliasing), but bit_cast<char *>(u) may or may not be (depending on the platform's choices of object representation).

Perhaps you could elaborate on your JVM implementation strategy; I am not sure what you mean by "avoid[ing] emulating memory". I suppose strict aliasing would allow you to have, say, ints and floats in totally separate address spaces – e.g. maybe int * is represented as an index into a JVM array of ints, and float * is represented as an index into a separate JVM array of floats. But since char * can access any type, each dereference of a char * would have to figure out which array to index into, which sounds really slow.

comex commented 3 years ago

I think that's just C++ being silly.^^ There is also some UB in the C preprocessor if I recall. But that is, on a formal/technical level, very different from the kind of UB that is used for optimizations, so they should really not use the same term.

On a formal/technical level, UB is the definition I provided above, behaviour not limited in any way. It is very much intended to use that term. Though, imagining what kind of havoc you could wreak by #define int float, the cpp UB is very much justified.

No, I do not think it is. Behavior should just be the exact same as what happened if I manually replaced int by float everywhere. That is likely UB, but maybe it isn't. Either way there is no reason at all to add a new UB clause for this.

There are two axes here.

A. Should `#define int float` result in UB in the generated program?

In C++: Maybe not by itself. But if you #define int float and then #include a standard library header, that basically has to be library-UB. Any particular implementation of the standard library may or may not actually invoke language-UB if included with such a definition present, but if it does, it's not a bug in the implementation.

Similarly, even if you #include all your headers before doing #define int float, if you call any standard library functions after the #define... well, the C spec generally allows standard library functions to be implemented as macros, and the macro expansion would be affected by your #define. So this too needs to be library-UB.

Rust, of course, has less of an issue with this, since it uses a proper module system rather than textual inclusion, and even macros are somewhat hygienic. But it certainly has plenty of library-UB.

That said, C does have UB for some things, like not ending a source file in a newline, that should definitely not be UB.

B. Should `#define int float` result in UB at compile time?

i.e. is the compiler allowed to format your hard drive?

Perhaps not. But there are extremely pathological cases like, in Rust:

#[link_args = "-Wl,-Map=/arbitrary/path"]
extern {}

which will cause the linker to try writing a map file to /arbitrary/path, which might be your hard drive block device :)

And in theory a C compiler might define some similar construct as an extension, and including a standard library header with weird definitions could somehow cause one to be accidentally invoked.

Even in Rust, as it stands today, there could hypothetically be some __foo_macro_implementation_detail! defined by the standard library which, if invoked with the wrong arguments, could produce such a declaration. But it would be nice (if mostly inconsequential in practice) to rule that possibility out – to guarantee that no such implementation details are accessible from normal crates. I think that would require better macro hygiene.

chorman0773 commented 3 years ago

The latter avoids strict aliasing violations by doing the memory access at type float.

What I mentioned was that the particular cast does not actually compile. reinterpret_cast only allows a particular set of conversions, and float to int is not within those (the exhausitve list is pointer->equally or more cv-qualified pointer, pointer->integer of a wider size, integer->pointer, nullptr_t to integer or any pointer type). There is nothing in the C++ standard that implies it has to be a bitwise no-op. Conversely, std::bit_cast (in C++20) is more akin to transmute (specifically, transmute_copy, though it retains the same-size requirement, and adds a requirement that the source and destination be trivially-copyable), in that std::bit_cast<U>(t) produces a value of type U for which the value is represented by object-representation of the object refered to by t (with the obvious caveats transmute has, if multiple such values exists, the value returned is unspecified, if no such value exists, the behaviour is undefined).

Behavior should just be the exact same as what happened if I manually replaced int by float everywhere

Well, specifically, the UB is if you define a reserved identifier then include a standard library header (or any name declared in a standard library header). IE. where you are not privileged to redefine the functions doing it manually. It would be the same if a rust program used some mechanism to replace every instance of i32 with f64 in libstd (thus altering the ABI), but not replace the actual binary code. C++ doesn't allow the standard library to just cause undefined behaviour when it feels like, even when the user does something wrong. It's only allowed to cause the UB the standard says (and there is no such thing as "delayed" UB in C++). It may be reasonable to make it ndr, but that is effectively UB (in that, if the program is succesfully translated, and then subsequently executed, the behaviour is undefined).

Perhaps you could elaborate on your JVM implementation strategy

An incredibly complicated web of pointer types, with 6 simple primitive operations: read, write, call_site (for Function pointers), reinterpret_as, offset_at, and difference. If you reinterpret_cast to something you aren't allow to dereference, you get a MiscastedPointer (which reads/writes throw exceptions). Beyond that, at last count I have 19 different types for storing representing different pointers (though one is the abstract PointerBase, which just defines the primitive operations, and defaults for the operations). Some cases, yeah, it can be stored in an array. I make absolutely no promises about the speed or efficiency of such an implementation. To explain how everything works would take considerable time, and is out of scope for the thread really. But this is how I exploit everything C++ allows, and its the undefined behaviour which allows such an interpretation.

If you have a concrete optimization that is powerful enough to justify such UB, please post it in #84. Also I don't think you mean what is usually called reachability analysis in the literature.

I've mentioned the particular optimization there previously, where I want to assume pointers to the inner field of an Option<T> or similar cannot access the discriminant field (IE. pointers to the innner field cannot reach the discriminant, and thus cannot change the variant), even when they physically overlap in memory because of niche-variant optimization (including the mandatory cases).

RalfJung commented 3 years ago

@comex Ah sure, if we are talking about library UB here that's a different situation. A C/C++ library can hardly protect against such "preprocessor attacks" so it pretty much has to declare them UB.

@chorman0773

I've mentioned the particular optimization there previously, where I want to assume pointers to the inner field of an Option or similar cannot access the discriminant field (IE. pointers to the innner field cannot reach the discriminant, and thus cannot change the variant), even when they physically overlap in memory because of niche-variant optimization (including the mandatory cases).

By "concrete optimization", I mean explicitly showing the source code before and after the transformation. (After the transformation might be in some IR as optimizations are not typically done as src-to-src transformations on surface Rust.) It's much easier to talk about specifics than trying to interpret your very general wording. But, I think I get the idea; the examples at the top of https://github.com/rust-lang/unsafe-code-guidelines/issues/84 go in that direction.

Also note that the discriminant of a niche-optimized enum is not a "field" in usual Rust speak, so you are already diverging from our usual terminology here. In fact I think that is the entire source of our disagreement here: I don't think memory should be structured beyond the level of allocations, i.e., inside an allocation, all there is is a list of (abstract) bytes (potentially unintiialized or with attached provenance). But nothing of the "type structure" of the language is reflected in the memory itself. Having to handle the discrepancy between such an "object-based" view and a byte-level view (which is accessible via char*) is one of the things that make the C/C++ memory models extremely complicated, and IMO unnecessarily so. This PhD thesis explains in great detail what it would take to make this precise; the fact that the C/C++ standards do not spell out all the details shown in that thesis shows how much work is left to precisely specify C/C++. I know that C++ went much further here than C (in terms of making things even more abstract), but as you noticed, I am not familiar with exactly how much further they went (I had no idea you cannot reinterpret_cast between floats and ints) -- and already what C does I consider going way too far. In Rust we certainly want to permit transmuting between float and int, so probably C is a better language to compare with than C++. I also am more familiar with C, so I am going to stick to C henceforth.

You did not complain about my suggestions to say that C without strict aliasing can be defined to handle loads similar to union-based type punning, so I assume that that works out fine.

The experience with C shows that no matter what the standard says, programmers will assume that they can use such low-level idioms for their programs. That is the entire reason why they are using these languages to begin with. I think we would be doing programmers a disservice by unduly diverging from that view. We need to do something "weird" for uninitialized memory, and we need to do something "weird" for provenance, otherwise we are not competitive in terms of performance. (I admit I do not have good evidence for these claims, they are based on anecdotes and "seeing what everyone else does".) But I do not see remotely enough evidence that we need to do more than that. In fact, I see evidence to the opposite: LLVM does not have such an "object-based memory model"; LLVM allocations are "flat lists of abstract bytes". And yet LLVM is able to deliver competitive performance for C. I see this as conclusive evidence that the object-based memory model is an unnecessary complication and should be discarded.

To give a concrete example: C special-cases char* to permit doing byte-wise accesses to memory, disregarding any "object structure" or so. In Rust, we have no intention of special-casing any particular type like this. Every type should be permitted to do what char* does in C, i.e., reinterpreting arbitrary parts of memory at the given type (subject only to aliasing restrictions when references are involved). People will do this anyway even if we tell them not to, and LLVM shows that we can deliver good performance without such special cases, so all evidence IMO shows that this is the better approach.

chorman0773 commented 3 years ago

Ah sure, if we are talking about library UB here that's a different situation

Yes, this is effectively what you call library UB. The clause is that if you define or undef a name that is a reserved identifier, context sensitive keyword (like override or final), a name in any standard library header, or the name of a standard attribute (with two exceptions, likely and unlikely may be defined as function-like macros), in the same TU as you include a standard library header, its UB. You can, in fact, #define int float as much as you want on your own time, though you aren't doing much in that file.

You did not complain about my suggestions to say that C without strict aliasing can be defined to handle loads similar to union-based type punning

Yes, the C memory model can be divorced from strict-aliasing, and the behaviour assigned in any way really. Reinterpreting object representations is a valid such assignment, and this is (in fact), what lccc intends to use, absent the pointer attribute which enables strict-aliasing based optimizations.

Also note that the discriminant of a niche-optimized enum is not a "field" in usual Rust speak, so you are already diverging from our usual terminology here

This is how it's translated, I'm not saying it's a field rust code can directly access, but it's a field inside a struct inside a union (actually 2 identical fields in the Common-initial sequence of a single union value) when the enum definition is translated. It makes it easy to deal in generic contexts. lccc does generic instantiation as one of the last steps before codegen (with a few reduction passes afterwords, including name mangling, and type replacement, where the actual niche-optimization happens).

transmuting between float and int

C++ does allow this with std::bit_cast. As mentioned, reinterpret_cast is closer to some as conversions, with std::bit_cast being similar to transmute_copy, with transmute's size check. It comes will all the guarantees of transmute_copy as to the well-definedness of the conversion. You can do std::bit_cast<float>(0) and get out whatever (int)0 looks like as a float (which should be +0.0 on IEC 559).

I see this as conclusive evidence that the object-based memory model is an unnecessary complication

Possibly, though it makes it easier to reason about memory accesses, particularly atomic ones, and, in my opinion, it makes UnsafeCell less magic (it's still magic, but it's less magic). It also makes it less magic to talk about types, such as pointers, where multiple distinct values share a single object-representation.

Diggsey commented 3 years ago

This is how it's translated, I'm not saying it's a field rust code can directly access, but it's a field inside a struct inside a union

An implementation that treated enum discriminants as a field would not be compliant with the Rust language, since Rust guarantees that Option<&T> has the same layout as T*.

comex commented 3 years ago

lccc does generic instantiation as one of the last steps before codegen (with a few reduction passes afterwords, including name mangling, and type replacement, where the actual niche-optimization happens).

(...does that mean you don’t do inlining after generic instantiation?)

chorman0773 commented 3 years ago

that mean you don’t do inlining after generic instantiation

Generic instantiation here refers to a multistep cycle including instantiation, type checking, and inlining. All inlining done by lccc itself is done during generic instantiation. It also passes off to a code generator, for which the current one is llvm, which itself does its own optimization passes (however, and end goal is to not be limited to using an 3-step codegen with an ir generator after xlang is done doing whatever).

An implementation that treated enum discriminants as a field would not be compliant with the Rust language, since Rust guarantees that Option<&T> has the same layout as T*.

It ends up this way during type lowering. Option is translated with an attribute, niche_optimize which allows (and requires) the compiler to merge the discriminant field with the payload. However, I still treat the discriminant and payload as (logically) distinct fields, for the purposes of reachability checking (but see my point in #84 for possible blockers). The primary thing this does is casting a pointer *mut T to a pointer *mut Option<T> does not yield a pointer to the outer option (but you can still write an Option<T> value through, because of how it handles mistyped accesses).

RalfJung commented 3 years ago

I forgot to respond to this earlier point, @chorman0773

Here I'm referring to something documented by the language as UB, it's still UB, the compiler just happens to have assigned meaning to it, either silently or in a documented way. It's still the same language as it's something the language permits, defining UB doesn't alter any observable behaviour of a correct program.

It is very much a different language! A language is defined by (among other things) its Abstract Machine. If you change which programs are UB, you changed the Abstract Machine, and thus you changed the language. Programs that have meaning in one language are meaningless in the other.

The languages are related, though. We could say that the language with less UB refines the language with more UB (alluding to the standard correctness notion of (behavioral) refinement, which you might know as "the as-if rule"). If your compiler implements a language different from Rust that refines real Rust, it is also a correct implementation of Rust. But when we take a principled approach and try to, e.g., formally prove correctness of that compiler, we have to account for this difference in UB. This is easy to see: if we tried to verify your UB-exploiting standard library against Rust, we would fail as it is UB! So to be able to verify its correctness, we have to precisely define the language that the compiler actually implements, and then verify the standard library against that. And even if you do not want to give a proper proof, you need to do most that even to just argue for correctness -- which is the least I would expect from a compiler.

RalfJung commented 3 years ago

it makes it easier to reason about memory accesses

Easier for whom? The one whose reasoning we should make easier is the programmer. And I think their life is made considerably harder by complicating memory this much -- they now have a ton of extra state to consider!

Ease of reasoning for compiler authors is only a secondary concern to me. And even for them -- I would say a language that requires all this machinery is harder to reason about than one where memory is flat.

My impression is that this "object structure" seems like it helps intuition, but any attempt to underpin this intuition with more precise reasoning shows that in fact the intuition is quickly mislead and the actual reasoning you have to do is way more complicated.

chorman0773 commented 3 years ago

Easier for whom

Fair, I should have mentioned this was my opinion. Reasoning about objects as a collection of bytes is fine locally, but when you get into atomic operations and multithreading, it can be nicer to reason about indivisible units which atomic accesses apply to. Same, to a lesser extent, with volatile accesses, especially since volatile operations are observable side effects, so you have to define what that side effect is. If you talk about the the read of n bytes, is that n reads of 1 bytes (and thus n side effects), or 1 read of n bytes (1 side effect). The object model gives an answer, a side-effect on a scalar is either a store to that scalar object, or either a load or a store made through a volatile lvalue.

It also makes UnsafeCell<T> not horrible arcane magic (only regular magic, and someone could implement the same in their own non-portable crate using #[__lccc::xlang_field_attributes(mutable)] if they opt-in to the lccc_instrinsics_crate feature, note please don't do that).

chorman0773 commented 3 years ago

This piece of code was brought to my attention earlier today, https://github.com/rust-lang/rust/blob/master/library/std/src/io/mod.rs#L373.

While, based on the disposition of #77, this is not, in fact, undefined behaviour, I would point out this actually favours my argument that the standard library is more likely to need to contain strict undefined behaviour that isn't b/c compiler shenanigans.

This argument actually comes from the fact that the standard library is supposed to be privileged. I would assume that one just can't implement the standard library in user code, or it's not necessarily a useful standard library. Though certainly in rust's case, it seems to more favour use of unstable features to outright UB.

Diggsey commented 3 years ago

I would point out this actually favours my argument that the standard library is more likely to need to contain strict undefined behaviour that isn't b/c compiler shenanigans.

I don't think this means all that much, but if anything it indicates the exact opposite since it's not UB?

This argument actually comes from the fact that the standard library is supposed to be privileged.

#![no_std] is a thing. You might be able to apply that logic to libcore, but anything above that is supposed to be implementable without having to make assumptions about the compiler internals.

chorman0773 commented 3 years ago

I don't think this means all that much, but if anything it indicates the exact opposite since it's not UB?

The comment left indicates that when it was written it was at the very least considered undefined (whether or not it is depends on #77)

You might be able to apply that logic to libcore

libcore is part of the standard library. Also, there are some parts of both liballoc and libstd that require compiler support. In particular, rust's alloc::boxed::Box is a lang item (but unless it looks like box_syntax or box_patterns are going to be stablized, lccc's equivalent is not). Lang items depend on the internals of the compiler. lccc makes some things lang items (and has some internal lang items), some things intrinsics, provides implementations for some things, and pulls some things out of it's synthetic crate, that rustc's implementions are or are not (an example in of the 3rd is that neither ::core::cell::UnsafeCell, nor ::core::mem::MaybeUninit are lang items, and MaybeUninit is implementable as-is in user-space, except that it uses const core::mem::zeroed).

bjorn3 commented 3 years ago

In particular, rust's alloc::boxed::Box is a lang item (but unless it looks like box_syntax or box_patterns are going to be stablized, lccc's equivalent is not).

Box is special in a stable way. You can move out of a box: let a = *Box::new(vec![]);.

chorman0773 commented 3 years ago

Box is special in a stable way

I moved and generalized that particular lang item, into an unstable (but not quite, the impl on Box is stable to use via operators) DerefMove trait. I believe discussions are already in place to make that part of rust.

RalfJung commented 3 years ago

The comment left indicates that when it was written it was at the very least considered undefined

Indeed, and the comment also indicates that this is considered a bug in the standard library, precisely because not even the standard library may cause UB. In rustc, libstd is not privileged wrt UB, and any place that acts differently is a bug. The "privileged knowledge" part here explains why this is not a P-high bug that needs fixing immediately (it argues for why this bug is currently unlikely to negatively impact users), but it does not make this any less of a bug.

This is very different from saying "it is okay for libstd to do something like this". It is not okay, and this particular bug is on track to be fixed by this RFC. One that RFC is implemented, this FIXME will finally disappear. I have been waiting for that for a long time. :)

chorman0773 commented 3 years ago

considered a bug in the standard library

As far as I can tell, it was done intentionally, perhaps to satisfy a requirement that is impossible or grossly inefficient otherwise. Even if it is considered a bug, it may be a necessary one. I have not looked at the RFC but given the choice is to change the language, not the implementation, I stand by what I said. Standard libraries will frequently doing things that require explicit compiler support to be efficient, not necessarily because it would be impossible otherwise (though as mentioned, a reason for something being in the standard library is that it's impossible to implement in the language itself). For example, while not UB, clang and gcc defined an intrinsic to implement std::make_integer_sequence<T,N> in less than O(n) template instantiations (I think it's O(lgN), lccc does it internally in O(1)).

rust-lang / unsafe-code-guidelines

Meaning of Undefined and Justification for UB #253

A. Should #define int float result in UB in the generated program?

B. Should #define int float result in UB at compile time?

A. Should `#define int float` result in UB in the generated program?

B. Should `#define int float` result in UB at compile time?