Invalid codegen when comparing pointer to one past the end and then dereferencing that pointer

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.02k stars 11.96k forks source link

Invalid codegen when comparing pointer to one past the end and then dereferencing that pointer #51912

Open GabrielRavier opened 2 years ago

GabrielRavier commented 2 years ago


Bugzilla Link	52570
Version	trunk
OS	Linux
CC	@dwblaikie,@DougGregor,@randomnetcat,@JohelEGP,@momchil-velikov,@zygoloid

Extended Description

extern int x[1], y;

int f(int *p, int *q) {
    *q = y;
    if (p == (x + 1)) {
        *p = 2;
        return y;
    }
    return 0;
}

LLVM trunk currently outputs the following code with -O3 -mtune=znver3 (without it it outputs correct code but I'm pretty sure it still makes the same wrong assumption about the value of p):

f:
        mov     eax, dword ptr [rip + y]
        mov     dword ptr [rsi], eax
        xor     eax, eax
        cmp     rdi, offset x+4
        je      .LBB0_1
        ret
.LBB0_1:
        mov     eax, dword ptr [rip + y]
        mov     dword ptr [rip + x+4], 2
        ret

Which is incorrect because p could point to y, for example if f was called as such:

int whatever;
f(&y, &whatever);

and y could happen to be located in memory right after x.

Also, although the comparison's result is unspecified, this still means only two results are possible according to the standard:

if p == (x + 1) results in false, then the result of f is 0
if p == (x + 1) results in true, then the result of f is 2 since we do *p = 2 and p points to y

LLVM's optimization makes it so the result can also be the previous value of y, which could be something else than 0 or 2.

It seems that LLVM assumes that because p == (x + 1) it can replace all occurences of p with x + 1 without any regard to provenance, and doing that change manually would indeed mean the return y; could be optimized to use the previous store (and the store to x + 1 would be UB, too...), but this isn't the case here: p could simultaneously validly point to y and be equal to x + 1.

Godbolt link: https://godbolt.org/z/v73ME48qd

GabrielRavier commented 2 years ago

There is an explicit provision if you read the standard itself:

If one pointer represents the address of a complete object, and another pointer represents the address one past the last element of a different complete object, the result of the comparison is unspecified.

[expr.eq] (https://eel.is/c++draft/expr.eq#3.1)

Also, I believe the article you are citing is incorrect about relational comparisons of pointers to different objects, the standard only indicates that "neither pointer is required to compare greater than the other" ([expr.rel] https://eel.is/c++draft/expr.rel#4.3) if they don't point to the same object, which implies an unspecified result, not undefined behavior.

This wording seems to only apply to "The result of comparing unequal pointers to objects" - pointers to one-past-the-end are not pointers to objects, so I think that's where the "undefined behavior" arises from. The spec doesn't say what should happen with such a comparison, so it's undefined.

No, [expr.eq] which I just cited explicitly says that comparison's result is unspecified, and as I've detailed in the description, this does not give LLVM license to do the transformation it does, as it has no way of ensuring &y == (x + 1) will always return false.

With regards to other relational operators, the standard also has a footnote that is linked to in the same sentence in which "The result of comparing unequal pointers to objects" is found: "A pointer past the last element of an array of n elements is considered to be equivalent to a pointer to a hypothetical array element n for this purpose."

[expr.rel] (https://eel.is/c++draft/expr.rel#footnote-71)

A note in 6.8.2 [basic.compound] also mentions "A pointer past the end of an object (7.6.6) is not considered to point to an unrelated object of the object’s type that might be located at that address."(In reply to Jason Cobb from comment #3)

I only access the other object using p, which is a pointer that actually was derived from that object (where f is called with &y). If the comparison, by itself, made it so the compiler can assume p was derived from x, then that would imply that evaluating &y == (x + 1) would make *&y invoke undefined behavior, which I would hope you agree would be stupid.

The webpage cited in Comment 1 is about C, not C++.

Fair enough - I Could totally believe this is defined in C. We might already have a bug to dup against for that case.

I'm pretty sure this code does not invoke undefined behavior in both C and C++ (although it is unspecified whether it returns 0 or 2, but that does not give license for the transformation as it can give other results).

momchil-velikov commented 2 years ago

In the LLVM IR the store of 2 is done using a pointer value, whose underlying object is x.

IMHO, that is incorrect.

p == x + 1 evaluating to true could mean p is a non-dereferencable pointer into x or p points to an unrelated object, that happens to be allocated adjacent to x, but it does not follow that p does not point into y.

define dso_local i32 @_Z1fPiS_(i32* readnone %0, i32* nocapture %1) local_unnamed_addr #0 {
  %3 = load i32, i32* @y, align 4, !tbaa !3
  store i32 %3, i32* %1, align 4, !tbaa !3
  %4 = icmp eq i32* %0, getelementptr inbounds ([1 x i32], [1 x i32]* @x, i64 1, i64 0)
  br i1 %4, label %5, label %7

5:                                                ; preds = %2
  store i32 2, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @x, i64 1, i64 0), align 4, !tbaa !3
  %6 = load i32, i32* @y, align 4, !tbaa !3
  br label %7

7:                                                ; preds = %2, %5
  %8 = phi i32 [ %6, %5 ], [ 0, %2 ]
  ret i32 %8
}

dwblaikie commented 2 years ago

There is an explicit provision if you read the standard itself:

If one pointer represents the address of a complete object, and another pointer represents the address one past the last element of a different complete object, the result of the comparison is unspecified.

[expr.eq] (https://eel.is/c++draft/expr.eq#3.1)

Also, I believe the article you are citing is incorrect about relational comparisons of pointers to different objects, the standard only indicates that "neither pointer is required to compare greater than the other" ([expr.rel] https://eel.is/c++draft/expr.rel#4.3) if they don't point to the same object, which implies an unspecified result, not undefined behavior.

This wording seems to only apply to "The result of comparing unequal pointers to objects" - pointers to one-past-the-end are not pointers to objects, so I think that's where the "undefined behavior" arises from. The spec doesn't say what should happen with such a comparison, so it's undefined.

A note in 6.8.2 [basic.compound] also mentions "A pointer past the end of an object (7.6.6) is not considered to point to an unrelated object of the object’s type that might be located at that address."

The webpage cited in Comment 1 is about C, not C++.

Fair enough - I Could totally believe this is defined in C. We might already have a bug to dup against for that case.

Some other related, possibly duplicate bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61502 llvm/llvm-project#33896 #c77 llvm/llvm-project#45725

GabrielRavier commented 2 years ago

Ah, well, here's a quote from the C standard, then:

Two pointers compare equal if and only if [...], or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

C Standard, 6.5.9. Equality operators

If you're wondering about its usage of "array" when referring to the non-past-end-of-array pointer, here's another quote that should clarify that:

For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.

C Standard, 6.5.9. Equality operators

It does look like for all the other relational operators, it is indeed true that "the behavior is undefined" (6.5.8. Relational operators) when used on pointers to different objects, although this should not be relevant to the test case from the description as it uses == and not any of the affected operators (<, >, <= and >=)

Also, it should be noted that Clang's behavior on this does not change if the testcase is compiled as C++.

randomnetcat commented 2 years ago

The webpage cited in Comment 1 is about C, not C++.

GabrielRavier commented 2 years ago

There is an explicit provision if you read the standard itself:

If one pointer represents the address of a complete object, and another pointer represents the address one past the last element of a different complete object, the result of the comparison is unspecified.

[expr.eq] (https://eel.is/c++draft/expr.eq#3.1)

Also, I believe the article you are citing is incorrect about relational comparisons of pointers to different objects, the standard only indicates that "neither pointer is required to compare greater than the other" ([expr.rel] https://eel.is/c++draft/expr.rel#4.3) if they don't point to the same object, which implies an unspecified result, not undefined behavior.

dwblaikie commented 2 years ago

I believe this is actually /undefined/ behavior, rather than just unspecified. For instance https://en.cppreference.com/w/c/language/operator_comparison lists a series of comparisons that are allowed, and the rest are undefined. The allowed list includes: "if one pointer points to the element of an array and the other pointer points one past the end of the same array, the one-past-the-end pointer compares greater" and "if two pointers point to the same object, or both point one past the end of the same array, they compare equal"

But there's no other provision for comparison of one-past-the-end pointers.