Closed steakhal closed 6 months ago
@llvm/issue-subscribers-clang-static-analyzer
Author: Balazs Benics (steakhal)
I used this patch to find the maximum symbol complexity:
diff --git a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
@@ -258,6 +258,12 @@ std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
if (!Sym)
return TaintedSymbols;
+ static unsigned maxComplexity = 0;
+ if (unsigned comp = Sym->computeComplexity(); comp > maxComplexity) {
+ maxComplexity = comp;
+ llvm::errs() << "New max complexity: " << comp << "\n";
+ }
+
// Traverse all the symbols this symbol depends on to see if any are tainted.
for (SymbolRef SubSym : Sym->symbols()) {
if (!isa<SymbolData>(SubSym))
If I skip taint checking in OOBv2, the running times improve.
For the decode_c82p
entry point, previously it took 1456 ms, but with skipping the taint checking it only takes 1706 ms.
It would be still 17% slower than before, but considering that this is more of an edge case, that would be somewhat acceptable.
I applied this for disabling taint checking:
diff --git a/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp b/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
@@ -401,7 +401,7 @@ void ArrayBoundCheckerV2::performCheck(const Expr *E, CheckerContext &C) const {
reportOOB(C, ExceedsUpperBound, OOB_Exceeds, ByteOffset, Msgs, E);
return;
}
- if (isTainted(State, ByteOffset)) {
+ if (false && isTainted(State, ByteOffset)) {
// Both cases are possible, but the offset is tainted, so report.
std::string RegName = getRegionName(Reg);
Thanks for alerting me, I'll investigate this issue. I'm not too familiar with the tainted symbol algorithm, but I roughly recall that it has bad worst-case complexity, so I'm not very surprised that it blew up.
I found some exponentially inefficient recursion in getTaintedSymbolsImpl()
, which is clearly capable of producing the symptoms that you observed, especially on the code of sheervideo.c
which is full of complex nested ElementRegion
expressions. I cannot explain why this inefficiency would be activated/triggered by the commit which changes the callbacks of ArrayBoundV2
, but I think that it's too "good fit" to be a coincidence (and it should be fixed even if your observations are caused by a different bug).
Could you check whether my PR https://github.com/llvm/llvm-project/pull/89606 resolves the performance issue that you observed?
Thanks for looking into this issue. I really appreciate that!
If we accept how we check ElementRegions
for taint, and check both Base-region and the Offset, then we could memoize the results to cut recursion.
If we say, everything works as intended, but ElementRegions
should be considered tainted only if the Offset is tainted, then your PR seems valid to me.
If we ask ourselves, why was the maximal symbol complexity of the symbols passed to isTainted
at most 4-5 prior the change, and suddenly 31-35 (hitting the max complexity), then I'm not sure if this is expected and intended. We might end up with other consequences because of these largely complex symbols in other places, and isTaint
was just an example.
Depending on this, we might cure a symptom and not the cause with the PR you propose or memoization.
I'd be interested in what is the value of having so complex symbols at times, and what is actually causing those huge symbols.
If we say, everything works as intended, but
ElementRegions
should be considered tainted only if the Offset is tainted, then your PR seems valid to me.
I think you're misunderstanding my commit -- an ElementRegion
with a tainted super-region will be considered tainted even after merging my commit (because the ElementRegion
is a SubRegion
and the branch for subregions still performs the check once).
I'd be interested in what is the value of having so complex symbols at times, and what is actually causing those huge symbols.
I agree that this would require separate investigation. My wild guess is that there is some normalization step that happens between the PostStmt
and the Location
callbacks...
I had a look at the patch and that seems solid. That should land even if we decide to continue this investigation here. Landing that would close this issue, but we can always reopen it or split a different ticket if the findings justify so.
I merged my commit and created a new issue for the symbol complexity part of this issue. I'll start working on that question, probably by examining the logic that happens between the PostStmt
and Location
callbacks.
When release testing on top of clang-18, I noticed some significant (many X) slowdown on certain projects, e.g. on FFmpeg. When investigated, the most impacted TU was
sheervideo.c
, where we had 33 seconds in the past, but now it takes around 1.23 hours to analyze. At that point, I'd count it as a hang.After bisecting, the blamed change was #72107
Switch to PostStmt callbacks in ArrayBoundV2
.I used
perf
to get a glimps on where we spend our time, and here is how it looks:The
OOBv2
checker checks array subscripts, and then it spends most of its time on traversing symbols to see if any of it is tainted.I've also checked that prior to this change, the maximum
Sym->computeComplexity()
withingetTaintedSymbolsImpl
was significantly lower than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.I dumped the state when the
getTaintedSymbolsImpl
appeared 63 times in the call stack and the state dump itself, was huge. Several if not all lines individually taking up 26 Megabytes. Many of these lines encoded the history of some "hashing-like" computation.I'm not there yet to make suggestions for a fix, but I anyways wanted to let you know. @NagyDonat
Here is the preprocessed reproducer: sheervideo.zip
Here is the command to run it: