llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.09k stars 11.59k forks source link

Clang misdiagnoses UB (emits unreachable IR) when incrementing unspecified value #58600

Open BenCStokes opened 1 year ago

BenCStokes commented 1 year ago

Clang, with -O1 or higher, seems to think the following C code invokes undefined behaviour. The IR output given by clang -std=c17 -O1 -emit-llvm -S has main as just unreachable.

#include <stdio.h>

int main(void) {
    unsigned char x;
    (void) &x;
    while (++x) putchar('a');
    putchar('\n');
}

My understanding is that this is not UB under C17 because the address of x is taken and unsigned char can't have trap representations, so the value of x is unspecified until the first increment. Interestingly, if you replace ++x with x+=1, Clang compiles the code correctly, even though these are equivalent according to the standard. The same bug happens with x++, --x or x-- instead of ++x.

erichkeane commented 1 year ago

Not sure about the UB rule here, but the reason x+=1 doesn't happen is I think that OPT (which notices the x 'uninit' value) is getting confused by the integer promotion.

If you instead replace that with a 'unsigned int', you get unreachable.

AaronBallman commented 1 year ago

By my reading of the C standard, this is unspecified behavior (so a different U than the usual meaning for UB). All references below are to C2x N3047.

6.2.4p6: ... The initial representation of the object is indeterminate. If an initialization is specified for the object, it is performed each time the declaration or compound literal is reached in the execution of the block; otherwise, the representation of the object becomes indeterminate each time the declaration is reached.

also 6.7.10p11: If an object that has automatic storage duration is initialized with an empty initializer, its value is the same as the initialization of a static storage duration object. Otherwise, if an object that has automatic storage duration is not initialized explicitly, its representation is indeterminate. ...

3.19.2: indeterminate representation object representation that either represents an unspecified value or is a non-value representation

3.19.3: unspecified value valid value of the relevant type where this document imposes no requirements on which value is chosen in any instance

3.19.4: non-value representation an object representation that does not represent a value of the object type

6.2.6.1p6: (does not apply because "does not have character type") Certain object representations need not represent a value of the object type. If such a representation is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined55). Such a representation is called a non-value representation.

This leaves "unspecified value: as the only possible indeterminate representation, so: 3.4.4: unspecified behavior behavior, that results from the use of an unspecified value, or other behavior upon which this document provides two or more possibilities and imposes no further requirements on which is chosen in any instance

However, it's very much worth noting that DR260 and DR451 are both involved and that's somewhat of an open area of discussion (around "wobbly values") and TS 6010 is relevant, so I would caution against making changes until it's more clear what direction the C committee is going.

shafik commented 1 year ago

As an extra note C++ makes this undefined behavior as per basic.indet p2, it carves out some exceptions for unsigned char and std::byte to produces indeterminate values in some cases but not this one.

llvmbot commented 1 year ago

@llvm/issue-subscribers-c

Clang, with -O1 or higher, seems to think the following C code invokes undefined behaviour. The IR output given by `clang -std=c17 -O1 -emit-llvm -S` has `main` as just `unreachable`. ```c #include <stdio.h> int main(void) { unsigned char x; (void) &x; while (++x) putchar('a'); putchar('\n'); } ``` My understanding is that this is _not_ UB under C17 because the address of `x` is taken and `unsigned char` can't have trap representations, so the value of `x` is unspecified until the first increment. Interestingly, if you replace `++x` with `x+=1`, Clang compiles the code correctly, even though these are equivalent according to the standard. The same bug happens with `x++`, `--x` or `x--` instead of `++x`.