llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.38k stars 12.15k forks source link

[analyzer] integral casts are not modeled correctly #50380

Open steakhal opened 3 years ago

steakhal commented 3 years ago
Bugzilla Link 51036
Version 12.0
OS All
CC @devincoughlin,@ASDenysPetrov,@martong,@Kojoley,@haoNoQ,@SavchenkoValeriy

Extended Description

I've triaged a couple of false-positive reports and I observed something interesting. Check this example: https://godbolt.org/z/vb1Y5z6GE

include

void clang_analyzer_printState() {}

void test_single(signed char c) { if ((unsigned int)c <= 200) { // 'c' supposed to be restricted to be non-negative.

      // The associated value range is not even close to the expected range.
      // 'c' is assumed to be within [-128,-56]
      // but it should be within [0,127].
      clang_analyzer_printState();

      // In fact, the execution of the program prints non-negative numbers.
      printf("%d\n", (int)c);
  }

}

ifdef WITH_MAIN

int main() { // All possible signed char values are tested. for (int i = -128; i <= 127; ++i) { test_single(i); } }

endif

In the example, we can see that the runtime behavior of the code differs from the abstract machine that we are symbolically model. As you can see, any bugreports in that path that depend on the value 'c' are inherently false-positives.

It can materialize in false-positives by ArrayBound V1 and V2 checkers, but supposedly any other checkers could also suffer from this:

buf[(unsigned int)c]

At the indexing operation, the analyzer is sure that the 'index' is negative, thus accessing an element prior to the first valid element of the array. While the user sees that the index expression is of an unsigned type, so they are confused with a reason.

AFAIK all released clang versions are affected by this behavior, according to Compiler Explorer.

I can understand that 'omitting' symbolic integral casts were a /good/ idea when we had a /dumb/ constraint solver. That was capable of looking up constraints for symbolic expressions in a verbatim manner. However, this is no longer true. The EQ classes and other witty stuff made our solver much smarter. And the /new/ non-fixpoint-iterative simplifier by Gabor Marton is just making it even more so. BTW it's fascinating to see where it goes.

I think the way forward would be to emit symbolic casts whenever necessary and see through them in the solver. For the record: Denys proposed several changes in this area, simplifying the representation of widening and narrowing symbolic cast chains, but I think that's a slightly different topic. That wouldn't immediately arm the solver with the necessary capabilities to see through the casts AFACT.

ASDenysPetrov commented 3 years ago

Proposed a fix https://reviews.llvm.org/D103096

steakhal commented 3 years ago

Good to hear, thanks for working on this topic.

ASDenysPetrov commented 3 years ago

Here we reason about the range of c's particular representation as uint. The range of c should be [-57, 127]. That's how C++ rules works and my solution proposes.

Oh, I miscalculate the range of 'c'. Of course it should be [0,127]. I've just tested the example with my solution (https://reviews.llvm.org/D103096) and it produces [0,127] as expected. I'll add the test case to tests.

ASDenysPetrov commented 3 years ago

Hey, guys. I've just updated the summary of my revision. You can see the description of the approach in details https://reviews.llvm.org/D103096

Talking about your snippet:

void test_single(signed char c) { if ((unsigned int)c <= 200) { // 'c' supposed to be restricted to be non-negative. // but it should be within [0,127].

Here we reason about the range of c's particular representation as uint. The range of c should be [-57, 127]. That's how C++ rules works and my solution proposes.

steakhal commented 3 years ago

In that example we can't know how many bytes does the symbol binding overwrite if we don't know the type of the symbol. And this is one of the places where the type of the symbol is the only piece of type information available.

Could you be slightly more specific? I could not follow.

From my perspective, we simply bind an incorrect symbolic value to the cast expression when we evaluate it. We should not simply model a cast expression by binding the value of the castee to the cast expression, at least not in the general case. Right now, we basically only look at the bitwidth of the types besides some truncation detection later:

SVal SValBuilder::evalIntegralCast(ProgramStateRef state, SVal val, QualType castTy, QualType originalTy) { // No truncations if target type is big enough. if (getContext().getTypeSize(castTy) >= getContext().getTypeSize(originalTy)) return evalCast(val, castTy, originalTy);

However, I think we should emit a symbolic cast if we are casting from signed to unsigned. In what specific way would make the analysis worse by emitting a symbolic cast there compared to the current situation?

A change that introduces 5-10 impossible-to-understand or outright false reports for every false positive it eliminates is not justified either. Could you please add some context? Do you imply that the solver cannot see through symbolic casts, thus we would have more false-positive bug paths depending on values derived from such symbolic casts, am I right?

haoNoQ commented 3 years ago

I thought memory modeling and the strict-aliasing madness is an orthogonal topic to evaluating casts according to the semantics of C/C++.

Modeling casts basically means achieving type-correctness. Type-correctness is helpful in a lot of places. In that example we can't know how many bytes does the symbol binding overwrite if we don't know the type of the symbol. And this is one of the places where the type of the symbol is the only piece of type information available.

This doesn't justify modeling a completely infeasible execution path, like in the example in this ticket.

A change that introduces 5-10 impossible-to-understand or outright false reports for every false positive it eliminates is not justified either.

steakhal commented 3 years ago

Constraint solver improvements may have improved the situation to the point where enabling casts is actually going to do more good than harm. If so, it'd be amazing and solve a lot of problems for us (including, but not limited to, llvm/llvm-project#43459 ). But I'll believe it when I see it. I thought memory modeling and the strict-aliasing madness is an orthogonal topic to evaluating casts according to the semantics of C/C++.

It was pretty bad last time I tried and I don't think my specific concerns were addressed yet. People are careless about integral types all the time and most of the time it's harmless. IDK.

Execution paths that involve overflows, truncations, or sign extensions, are very difficult to explain to a person unfamiliar with all these rules and may seem like false positives unless a really good system of notes is implemented; Yes, but that's an engineering problem. I agree it's indeed a difficult problem. This doesn't justify modeling a completely infeasible execution path, like in the example in this ticket.

we may have to suppress such paths entirely. IDK.

See also my thoughts in https://reviews.llvm.org/D103096 I cannot immediately see why we could not make a BRVisitor explaining the effect of the interesting casts.

P.S. Folks, I'm still shocked that you're trying to use array bound checker in production. It's terribly broken due to improper loop modeling. I'm fairly confident that it'll only ever work as a taint-based checker (warn about under-constrained tainted-symbol indexes) where concrete indexes introduced by loop unrolling don't intervene. Some clients are using the ArrayBounds[V2] checkers because they see value in them. It has proven to be useful, finding true-positives. On the other hand, they also tend to produce reports that are hard to follow. It's a personal preference what FP/TP rate is still acceptable.

I think taint analysis is great and many other checkers should consider using it, but let's not digress.

What I really want to conclude is that on the path within the if block, we should not associate a range with c that doesn't even intersect with the expected range [0,127]. If it were a super-range of the expected range, let's say c: [-128, 127], it would be a correct approximation, unlike the one we have right now.

haoNoQ commented 3 years ago

Constraint solver improvements may have improved the situation to the point where enabling casts is actually going to do more good than harm. If so, it'd be amazing and solve a lot of problems for us (including, but not limited to, llvm/llvm-project#43459 ). But I'll believe it when I see it.

It was pretty bad last time I tried and I don't think my specific concerns were addressed yet. People are careless about integral types all the time and most of the time it's harmless. Execution paths that involve overflows, truncations, or sign extensions, are very difficult to explain to a person unfamiliar with all these rules and may seem like false positives unless a really good system of notes is implemented; we may have to suppress such paths entirely. See also my thoughts in https://reviews.llvm.org/D103096

P.S. Folks, I'm still shocked that you're trying to use array bound checker in production. It's terribly broken due to improper loop modeling. I'm fairly confident that it'll only ever work as a taint-based checker (warn about under-constrained tainted-symbol indexes) where concrete indexes introduced by loop unrolling don't intervene.

steakhal commented 3 years ago

assigned to @ASDenysPetrov