Explicit Undefined Behavior not identified with a compile-time warning

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.04k stars 11.98k forks source link

Explicit Undefined Behavior not identified with a compile-time warning #53347

Open wjristow opened 2 years ago

wjristow commented 2 years ago

According to the Standard ([expr.new].4). the following code:

template<typename T>
class TVec3 {
public:
  T X, Y, Z;
  // These are undefined behavior whenever 'Index' is non-zero.
  inline __attribute__((__always_inline__)) T& operator[](int Index) {
    return (&X)[Index];
  }
  inline __attribute__((__always_inline__)) T operator[](int Index) const {
    return (&X)[Index];
  }
};

is undefined behavior for any non-zero values of Index.

Yet, we don't produce any warnings for this sort of thing, even with -Weverything (and even when uses of the operator[] methods appear with explicit non-zero values). For cases with explicit non-zero values, no static analysis is needed to see the problem. And even without explicit non-zero values, a warning indicating that it's undefined behavior if Index is ever non-zero would be useful, as we'd naturally expect these operator[] methods to sometimes be called with non-zero values (in fact, rare to only be called with 0). Something like: warning: uses of non-zero values for 'Index' results in undefined behavior

would seem fine -- that is, it wouldn't seem to be a "noisy false-positive".

FTR, clang's Static Analysis also doesn't note anything when non-zero index values are used. But I think the additional complexity of static analysis shouldn't be needed for this sort of thing.

(See issue #53307, for an example where this undefined behavior resulted in DSE removing some stores, causing a run-time failure. Thanks to @fhahn and @nikic for their work in identifying the problem there.)

Related to this, my understanding has been that the following approach is a safe way to do this sort of indexing:

template<typename T>
class TVec3 {
public:
  T X, Y, Z;
  inline __attribute__((__always_inline__)) T& operator[](int Index) {
    return (static_cast<T*>(&X))[Index];
  }
  inline __attribute__((__always_inline__)) T operator[](int Index) const {
    return (static_cast<const T*>(&X))[Index];
  }
};

That is, I've been under the impression that given a sequence of data members of the same type, and a pointer (of that type) pointing to one of them, then you can increment/decrement that pointer to safely access the neighboring data members. But in researching this today, I haven't been able to find a statement in the Standard that clearly supports this. In fact, I found one discussion that contradicts it. Specifically, at: https://en.cppreference.com/w/cpp/language/pointer

I see:

struct C {
   int x, y;
} c;

int* px = &c.x;   // value of px is "pointer to c.x"
int* pxe= px + 1; // value of pxe is "pointer past the end of c.x"
int* py = &c.y;   // value of py is "pointer to c.y"

assert(pxe == py); // == tests if two pointers represent the same address
                   // may or may not fire

*pxe = 1; // undefined behavior even if the assertion does not fire

FTR, assuming that's correct (i.e., assuming the *pxe = 1; is undefined behavior), then I'd have to think that if optimizations were done based on this undefined behavior, then a massive amount of code that we compile with llvm would get run-time failures. (Also, assuming that's correct, then I don't see any "clean" way to safely write an operator[] method for cases with a sequence of same-typed data members.)

llvmbot commented 2 years ago

@llvm/issue-subscribers-clang-frontend

nikic commented 2 years ago

Yes, I believe the second example is also UB. The right way to do this is probably declare the members as T[3], or make it a union with T[3].

Alternatively, -fno-strict-aliasing should disable these assumptions.

wjristow commented 2 years ago

Yes, I believe the second example is also UB. The right way to do this is probably declare the members as T[3], or make it a union with T[3].

Alternatively, -fno-strict-aliasing should disable these assumptions.

Thanks @nikic . Making it an array, or a union with an array, sound like good approaches to avoid this UB. (As an aside, the person who reported it to me did experiment with using -fno-strict-aliasing as a possible workaround, but their goal was to leave those strict-aliasing optimizations enabled.)

We're not doing language-design here, but I'd think having this as "implementation defined" behavior rather than "undefined" behavior would have been sensible. An implementation does have to decide how it lays out these members. And for consistency, it cannot change that layout from release to release. So in the code snippet:

assert(pxe == py); // == tests if two pointers represent the same address
                   // may or may not fire

*pxe = 1; // undefined behavior even if the assertion does not fire

if the assert is true for an implementation, it would seem fine to have made that been the "implementation defined" rule of the ABI layout. But again, we're not designing the language. :) So with it being UB, we need to live with that.

aaronpuchert commented 2 years ago

That is, I've been under the impression that given a sequence of data members of the same type, and a pointer (of that type) pointing to one of them, then you can increment/decrement that pointer to safely access the neighboring data members. But in researching this today, I haven't been able to find a statement in the Standard that clearly supports this.

The subscripting operator is syntactic sugar for pointer addition and dereferencing by [expr.sub]p2 and pointer arithmetic with a non-zero offset is undefined according to [expr.add]p4 if the pointer doesn't point into an array, or the offset is out-of-bounds. So the question is, can we consider this an array? I think the relevant clause is [basic.compound]p4, which more or less explicitly excludes arrays. So &X and this should be pointer-interconvertible, but there is no way to an array of T. I'm not sure if the strict aliasing rule is an issue, because the final access is correctly typed.

Also, assuming that's correct, then I don't see any "clean" way to safely write an operator[] method for cases with a sequence of same-typed data members.

Right, you either have to use an array for the coordinates (which I would advocate for) or have some kind of switch.

Yet, we don't produce any warnings for this sort of thing, even with -Weverything (and even when uses of the operator[] methods appear with explicit non-zero values). For cases with explicit non-zero values, no static analysis is needed to see the problem. And even without explicit non-zero values, a warning indicating that it's undefined behavior if Index is ever non-zero would be useful, as we'd naturally expect these operator[] methods to sometimes be called with non-zero values (in fact, rare to only be called with 0). Something like: warning: uses of non-zero values for 'Index' results in undefined behavior

would seem fine -- that is, it wouldn't seem to be a "noisy false-positive".

I guess it's one of those things that one rarely does by accident, so it would have more of an educational value. But still, it would be nice to make C++ somewhat beginner-friendly. So what is the scope here? Do we want to diagnose only that particular expression or do we want to detect this when spread over multiple statements, i.e.

T *foo = &X;
// ...
foo[Index];

Perhaps even over multiple functions?

fn(&X);
// Somewhere else:
template<typename T>
void fn(T *t) {
    t[1];
}

We could detect pointer arithmetic on something that we know is not an array, but the question is how far to take it. The last example would require the static analyzer certainly, but it's also the most realistic example. Analysis within a function could still happen in the compiler, and the "single statement" example could be diagnosed in Sema.

We're not doing language-design here, but I'd think having this as "implementation defined" behavior rather than "undefined" behavior would have been sensible. An implementation does have to decide how it lays out these members.

Implementation-defined behavior introduces portability issues of course, but I think the real reason to have this (and many similar things) as undefined behavior is that it prevents "address guessing" and thus enables better alias analysis. So for example I might be able to say that deriving a pointer to Y from a given pointer to X is impossible using defined behavior, which means that I can move loads and stores to Y across a function call that has a pointer to X as argument. (In that particular case I think it's not true because X is the first member of the record, thus they are pointer-interconvertible.)

aaronpuchert commented 2 years ago

The most straightforward thing would probably be to extend -Warray-bounds.

wjristow commented 2 years ago

Thanks for those thorough comments, @aaronpuchert. I especially appreciate the references in the Standard.

Right, you either have to use an array for the coordinates (which I would advocate for) or have some kind of switch.

I definitely agree that using an array is the best way to do this. Unfortunately, for the customer of ours who encountered this, the original test-case is substantially more elaborate, and changing to an array isn't straightforward.

Regardless, moving on to the discussion about the warning:

I guess it's one of those things that one rarely does by accident, so it would have more of an educational value. But still, it would be nice to make C++ somewhat beginner-friendly.

So in the example test-case:

template<typename T>
class TVec3 {
public:
  T X, Y, Z;
// ...
}

the indexing definitely isn't done by accident. But I'd say given that the Standard does require members to be laid out in order, it's not too surprising that for a given implementation, a programmer would expect it to be "fair" to index into them like:

  inline __attribute__((__always_inline__)) T& operator[](int Index) {
    return (static_cast<T*>(&X))[Index];
  }

Especially for the case where the size of the struct is 3*sizeof(T), and the alignment of the struct is the alignment of T (which they can verify that it is, for a given implementation -- and which cannot legally be changed by an implementation, unless the updated implementation is breaking compatibility with previous releases).

Granted, for a case with 3 consecutive variable definitions "in the wild" (that is, not within a struct):

T X, Y, Z;

Then indexing off the address of one of them would be very rare. And I wouldn't feel a need to warn about indexing that way (although if the warning implementation is such that warning about this naturally also happens, that would be fine). FTR, I see we don't warn about that case now, either.

So what is the scope here? Do we want to diagnose only that particular expression or do we want to detect this when spread over multiple statements, ...

I was assuming only warning in the more "focused context" (so not across statements, for example). I see that in cases of actual array references, we only warn in a focused context. For example, we warn on the reference to Array[10] below, but not on the reference to Array[Index]:

T Array[10];
//  ...
int Index = 10;
T A = Array[Index];     // No warning
T B = Array[10];        // Warning
T C = Array[9];         // No warning (of course)

(FTR, if Index is declared const int rather than simply int, then we also get a warning for the reference to Array[Index].)

So analogously, I think a warning for:

(&X)[Index];

but no warning for:

T *foo = &X;
//  ...
foo[Index];

is very reasonable. Warning about problems in a wider scope would also be fine of course, but I don't think it's required to make this useful (and I'd say the wider scope aspect feels more like a static analysis job, than a compile-time warning job).

The most straightforward thing would probably be to extend -Warray-bounds.

That sounds sensible to me. It especially seems "natural", since (&X)[0] is legal, but an explicit non-zero index like (&X)[1] is not. (FTR, we don't currently give a warning on the explicit non-zero index.)

Regarding the side-discussion about undefined behavior vs implementation-defined behavior:

Implementation-defined behavior introduces portability issues of course, but I think the real reason to have this (and many similar things) as undefined behavior is that it prevents "address guessing" and thus enables better alias analysis. ...

Hmmmm.... To be specific about what I previously thought was legal, and am thinking would make sense to consider to be implementation-defined, I mean:

That is, I've been under the impression that given a sequence of data members of the same type, and a pointer (of that type) pointing to one of them, then you can increment/decrement that pointer to safely access the neighboring data members.

My sense is that allowing that, wouldn't result in a loss of alias-based optimization opportunities. But I have to admit that I haven't thought this through completely. And that I don't know the details of llvm's alias-analysis implementation, so possibly it would complicate the implementation in a way that would result in it being more conservative (which is to say, to lose optimization opportunities). So maybe allowing this to be implementation-defined would have negative consequences.

aaronpuchert commented 2 years ago

But I'd say given that the Standard does require members to be laid out in order, it's not too surprising that for a given implementation, a programmer would expect it to be "fair" to index into them [...]

These are unrelated notions. C++ is not assembly language, and pointers are not addresses. Basically a pointer is derived from an object and “remembers” that object. This isn't visible in the generated machine code which leads some to believe that this connection doesn't exist. And of course the compiler enforces very little about this. (Perhaps such a warning could help here.)

Granted, for a case with 3 consecutive variable definitions "in the wild" (that is, not within a struct) [...] Then indexing off the address of one of them would be very rare.

What's indeed different in the case of local variables is “allocation address nondeterminism”, i.e. we have no idea how these are laid out on the stack, if at all.

(FTR, if Index is declared const int rather than simply int, then we also get a warning for the reference to Array[Index].)

Compiler warnings don't bother tracking possible values of non-const variables, which might be control-flow dependent. They'll only plug in compile-time constants, i.e. warn on constant expressions out of bounds.

But that's for the index, what about the pointer? Should we track that down? (If it's a constant expression?)

My sense is that allowing that, wouldn't result in a loss of alias-based optimization opportunities. But I have to admit that I haven't thought this through completely. And that I don't know the details of llvm's alias-analysis implementation, so possibly it would complicate the implementation in a way that would result in it being more conservative (which is to say, to lose optimization opportunities). So maybe allowing this to be implementation-defined would have negative consequences.

Well, isn't the original bug proof that we'd loose optimization opportunities if we made behavior defined? After all, there was some optimization (that you didn't want, but somebody else might) that we'd probably lose. It's hard for me to estimate how important that is, but we're not talking about the empty set at least.

wjristow commented 2 years ago

Well, isn't the original bug proof that we'd loose optimization opportunities if we made behavior defined? After all, there was some optimization (that you didn't want, but somebody else might) that we'd probably lose. It's hard for me to estimate how important that is, but we're not talking about the empty set at least.

Fair enough. Certainly the customer of mine who reported the run-time failure didn't want those stores to be eliminated, but somebody else might have an example where the store could be deleted without causing a run-time failure. And I have to admit that when I reduced the test-case the customer provided so I could clearly see what was going wrong, I still didn't realize it was undefined behavior (I completely thought it was an optimization bug -- it's only in looking into this more deeply that I came to understand it).

Regarding the discussion about the 3 consecutive variable definitions "in the wild":

What's indeed different in the case of local variables is “allocation address nondeterminism”, i.e. we have no idea how these are laid out on the stack, if at all.

We're completely in agreement here. I was pointing out that behavior as an example where regardless of the "undefined" vs "implementation-defined" discussion, clearly it's unsafe for a programmer to do address-arithmetic from one variable to get to another. And FTR, I was making that point for more than local variables that are "conceptually" laid out on the stack -- that is, address-arithmetic applied to global variables "in the wild" that are defined consecutively cannot be accessed via address arithmetic from one of their "neighbors".

And regarding the int vs const int out-of-bounds array references:

Compiler warnings don't bother tracking possible values of non-const variables, which might be control-flow dependent. They'll only plug in compile-time constants, i.e. warn on constant expressions out of bounds.

Again, we're in complete agreement here. I was only clarifying that in my answer to the question of:

So what is the scope here? Do we want to diagnose only that particular expression or do we want to detect this when spread over multiple statements, ...

since for explicit array references, we don't warn when the violation appears "across multiple statements", my view is it is fine to have the warning being discussed here to also not catch violations across multiple statements. But I felt that in full disclosure, I should note that if the out-of-bounds index is a const int, then the warning is produced in a multiple statement context.

That leads to your question:

But that's for the index, what about the pointer? Should we track that down? (If it's a constant expression?)

Similar to before, I think that's more elaborate than what is needed for a compile-time warning. For an explicit array, if we create a pointer to it, and reference it via an out-of-bounds constant expression, we don't warn:

T Array[10];
  ...

T * const foo = Array;
const int Index = 10;
T A = foo[Index];       // No warning
T B = foo[10];          // No warning
T C = foo[Index];       // No warning
T D = Array[Index];     // Warning

I'd say improving that would be fine, but outside of the scope of the discussion here. So independently, if we were to improve the warning-coverage in these constant expression situations, then it would make sense to also apply that improvement to the suggested new warning being discussed here.

aaronpuchert commented 2 years ago

But that's for the index, what about the pointer? Should we track that down? (If it's a constant expression?)

Similar to before, I think that's more elaborate than what is needed for a compile-time warning. For an explicit array, if we create a pointer to it, and reference it via an out-of-bounds constant expression, we don't warn:
T Array[10];
  ...

T * const foo = Array;
const int Index = 10;
T A = foo[Index];       // No warning
T B = foo[10];          // No warning
T C = foo[Index];       // No warning
T D = Array[Index];     // Warning
I'd say improving that would be fine, but outside of the scope of the discussion here. So independently, if we were to improve the warning-coverage in these constant expression situations, then it would make sense to also apply that improvement to the suggested new warning being discussed here.

Thanks for researching that! Then I'll guess we'll stick to an address-of operator (on a non-array) followed by indexing with anything but a constant 0.

wjristow commented 2 years ago

Thanks for researching that! Then I'll guess we'll stick to an address-of operator (on a non-array) followed by indexing with anything but a constant 0.

That sounds great. Thanks!

aaronpuchert commented 2 years ago

Maybe we can hear from @AaronBallman and @zygoloid what they think about warning on array accesses into non-arrays.

I've had a brief look into the implementation of -Warray-bounds, and it seems to go to some lengths to avoid false positives. It recognizes "tail-padded member arrays", for example.

AaronBallman commented 2 years ago

It's a pretty complex landscape (the rules in C and C++ are different because they have different object models, and the rules are sometimes not always obvious). For example, here's an array access into an object that is not an array, but is perfectly acceptable to both C and C++:

int a;
int *a_start = &a[0], *a_end = &a[1];

So long as you don't dereference a_end, you're fine. https://eel.is/c++draft/basic.compound#3.sentence-12 and C 6.5.6p8. There are similar games played for members of a structure being treated as an array, but only sometimes (C 6.2.5p15 about complex types).

I believe the original example is undefined behavior for any index other than 0. X is not an array object and you're indexing into it as if it was, and potentially dereferencing what you get. For 1, that's the "past the end" pointer (https://eel.is/c++draft/basic.compound#3.2) and it's not legal to dereference that (https://eel.is/c++draft/expr.add#4.3) because it doesn't point to a valid element of the array.

In general, I think the issue isn't the formation of the array from the non-array object so much as it is the access to the object as an array element. This suggests (to me at least) that the static analyzer or UBSan is a better place for such a diagnostic to live, because that will track the data flow for the access and which index is used. It may be possible to adjust -Warray-bounds, but I think that may generate false positives in code like:

void func(int *a, size_t max_count) {
  for (size_t idx = 0; idx < max_count; ++idx) {
    int i = a[idx];
    ...
  }
}

int main(void) {
  int array[10];
  func(array, 10);

  int a = 12;
  func(&a, 1);
}

wjristow commented 2 years ago

I was thinking of a relatively narrow situation where this warning would apply, and I don't think it would suffer from many false positives (possibly virtually zero false positives in real code).

Specifically, I was considering this as only applying to operator[] methods where the address of a non-array item was used as the base of the index. In that case, the only legal index for referencing memory is 0. After reading your concerns, I can see that my original suggested wording:

warning: uses of non-zero values for 'Index' results in undefined behavior

is too strong, because an index of 1 is legal, for a situation analogous to one of the code-fragments you noted:

int a;
int *a_start = &a[0], *a_end = &a[1];

With that in mind, I'd propose this wording (or something similar):

warning: referencing memory with non-zero values for 'Index' has undefined behavior

or, when there is a constant value used as the index (like &a[1]), then the wording could be something like:

warning: referencing memory of non-array item `a` via '&a[1]' has undefined behavior

As you noted:

So long as you don't dereference a_end, you're fine.

With this new proposed wording, the above warning is technically still "correct" for the a_end item, in that it warns that a pointer has been initialized, and if it is later dereferenced, then the behavior is undefined. (I'd agree that this could lead to false positives, but if that concern pans out, then the warning could be suppressed on an index of 1. Also, in the case of an explicit array, I'm not suggesting a warning be produced, and I would think that this sort of thing virtually never happens for non-array objects.)

Although I was considering to only have this warning apply to operator[] methods (rooted in the the address of a non-array item), as I think about it, it could also be applied to [] more generally, when the address of a non-array item is involved. (In the case of an explicit array, I'm not suggesting a warning be produced. And I would think generally that code in the style below virtually never exists, but it illustrates a point.) For example:

int a;

int foo(int index) {
  int r = 0;
  r += a;            // A normal reference to 'a'.  (Equivalent to the next line.)
  r += (&a)[0];      // Strange subscripting of a scalar, but index is 0, so valid.
  r += (&a)[index];  // Only valid when index is 0.
  r += (&a)[1];      // Non-0 index, so has undefined behavior.
  return r;
}

The warning could reasonably be applied on the following two lines (or even an error on the second line):

  r += (&a)[index];  // Only valid when index is 0.
  r += (&a)[1];      // Non-0 index, so has undefined behavior.

I'm not arguing that we should catch this sort of code, I'm just saying that if the implementation of the warning for the operator[] methods is such that warnings in these cases "naturally happen", that would be fine.

I have to admit that the point about complex types:

There are similar games played for members of a structure being treated as an array, but only sometimes (C 6.2.5p15 about complex types).

does lead to false positives in the proposed warning. Suppressing the warning for complex types may not be difficult?

I'd summarize, that my view is I think this can be implemented in a way that false positives wouldn't be an issue. (Granted, the area where this analysis is done isn't an area I've worked on. But my sense is that it sounds do-able.)

Lastly, I do very much agree that both the static analyzer and UBSan are natural places for this sort of problem to be caught. But with the caveat that I haven't worked in either of those areas, they seem like they would be more difficult to implement in those places, as compared with a compile-time warning. (My sense is probably the implementation would be pretty straightforward in the static analyzer, but more difficult in UBSan.) That said, when a warning can be done at compile-time (without "oppressive" false positives), I think it's much more useful, since running static analysis and/or sanitizers isn't in the day-to-day workflow of many developers.