llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.93k stars 11.53k forks source link

UBSan misses check for "negative_value << 0" #32480

Open dbabokin opened 7 years ago

dbabokin commented 7 years ago
Bugzilla Link 33133
Version trunk
OS All
CC @vedantk

Extended Description

cat f.cpp int i = -1; int z = 0;

int main() { return (i << 0) + (i << z); }

clang++ -fsanitize=undefined f.cpp

./a.out

No error is issued.

vedantk commented 7 years ago

Ouch, I have just debugged another issue with shift by negative value in gcc-ubsan, and haven't mentally switched to this one, which is, of course, shift of negative base. Sorry about it.

No worries!

I agree that the case with shift of negative base is much less compelling than shift by the negative value. And probably the only reason for it to be UB in the standard is an attempt to not assume any specific encoding of negative integers in HW. Though all practical implementations nowadays are "two's complement" encoding.

Seems likely.

I know about different groups of ubsan checks, but isn't the default "-fsanitize=undefined" implies the maximum level? The point about "paranoid" checks is about something beyond default level to not affect the average user. I don't even mind using a bunch of fine grain command line switches enable this "paranoid" level.

The "undefined" group is meant to contain all of the checks which diagnose UB, including the existing shift-of-negative-base check (which is incomplete, as you pointed out).

My hesitation here comes from wanting to avoid "-Wpedantic"-style diagnostics (even in a separate group/flag), and a desire to remove the shift-of-negative-base check instead of extending it.

dbabokin commented 7 years ago

Ouch, I have just debugged another issue with shift by negative value in gcc-ubsan, and haven't mentally switched to this one, which is, of course, shift of negative base. Sorry about it.

I agree that the case with shift of negative base is much less compelling than shift by the negative value. And probably the only reason for it to be UB in the standard is an attempt to not assume any specific encoding of negative integers in HW. Though all practical implementations nowadays are "two's complement" encoding.

I know about different groups of ubsan checks, but isn't the default "-fsanitize=undefined" implies the maximum level? The point about "paranoid" checks is about something beyond default level to not affect the average user. I don't even mind using a bunch of fine grain command line switches enable this "paranoid" level.

vedantk commented 7 years ago

No, I have not encountered a bug or "unexpected" behavior triggered by this kind of shifts. Though my usage case is not quite typical. I was using creduce + clang-ubsan to reduce a test case and keep it UB-free. And this shifts show up when I'm reducing bugs in gcc-ubsan. As a result of the reduction I have a test case with a shift by negative value instead of original problem. Of course, I can workaround it by filtering gcc-ubsan messages about shift by negative value.

I'm assuming you mean a shift of a negative base.

But I think "creduce+ubsan" is important usage case, which would benefit from stricter rules. Should we get different levels of ubsan messages probably? I would salut adding more and more paranoid rules to ubsan, but as you said it may scare away some users.

We have different groups of checks already. We don't have a "paranoid" group because we try to make every diagnostic compelling: if we find that some diagnostics need to be split off into a "paranoid" group, it'd be worth asking why those diagnostics exist. I appreciate your point about the need for strict checking. IMHO the shift of negative base check is not strict. Rather, it just isn't very compelling.

Also, wearing a hut of average user, my expectation from ubsan is to guard me from all kind of potential problems like future compiler optimizations and portability problems.

I really appreciate this point, since this is one of the major selling points of ubsan at Apple. However, the shift-of-negative-base check doesn't appear to diagnose UB which popular compilers can (or will) take advantage of. Doing so would "break all the code" (at the very least, it would break a lot of codecs I've seen).

And for shift by negative value I'm not even sure what expected semantic is. I guess for most of the people, who do have it intentionally in their code, it's x86 semantics of using just lower bits of rhs value. But what about portability to another HW platforms?

I agree that shifts by negative values are nonsensical. Clang's ubsan implementation diagnoses that kind of bug. Shifts of negative bases are a different matter. Users expect these to work just like shifts of unsigned values (i.e 'shl' without nsw/nuw).

As for potential compiler optimizations, which may possibly exploit it - vectorization, which may assume "good" value in rhs.

Clang does not lower '<<' into 'shl nuw/nsw', nor will it, AFAIK.

dbabokin commented 7 years ago

No, I have not encountered a bug or "unexpected" behavior triggered by this kind of shifts. Though my usage case is not quite typical. I was using creduce + clang-ubsan to reduce a test case and keep it UB-free. And this shifts show up when I'm reducing bugs in gcc-ubsan. As a result of the reduction I have a test case with a shift by negative value instead of original problem. Of course, I can workaround it by filtering gcc-ubsan messages about shift by negative value.

But I think "creduce+ubsan" is important usage case, which would benefit from stricter rules. Should we get different levels of ubsan messages probably? I would salut adding more and more paranoid rules to ubsan, but as you said it may scare away some users.

Also, wearing a hut of average user, my expectation from ubsan is to guard me from all kind of potential problems like future compiler optimizations and portability problems. And for shift by negative value I'm not even sure what expected semantic is. I guess for most of the people, who do have it intentionally in their code, it's x86 semantics of using just lower bits of rhs value. But what about portability to another HW platforms?

As for potential compiler optimizations, which may possibly exploit it - vectorization, which may assume "good" value in rhs.

vedantk commented 7 years ago

Have you encountered a scenario in which you get a bug or an unexpected result due to undefined left shifts with negative bases?

I'm asking because several people on IRC, and in person, have mentioned that it wouldn't be beneficial for the compiler to take advantage of this form UB (which, FWIW, is extremely common). Looking through llvm, I don't see any instances where we've taught the optimizer about this form of UB.

I'm very wary of adding diagnostics to ubsan that people don't want to pay attention to. Even if this type of expression technically has UB, if the diagnostic is not perceived to be related to a correctness problem, I've seen that users just turn the whole sanitizer off. It's important to avoid that outcome.

Of course if this UB has actually resulted in correctness problems, I would like to hear about it, and I could certainly help fix the diagnostic.

dbabokin commented 7 years ago

Could anyone have look please?

While it looks innocent, it causes problems when reducing test cases using creduce. Creduce introduces this pattern and clang fails to catch it. As a result, we have a reduced program with undefined behavior.