Open mikaelholmen opened 3 months ago
@llvm/issue-subscribers-clang-static-analyzer
Author: Mikael Holmén (mikaelholmen)
I'll have a look next week. Thanks.
Node reclamation bites us (again).
Here is a smaller repro: https://compiler-explorer.com/z/KTjxP11vd
(Note that to make the repro reproduce, you need to use the graph-trim-interval=19
. Without overriding that interval, one can only came up with really ugly reproducers otherwise to trigger the reclamation.)
struct wrapper { int member; };
void create_vla(const struct wrapper *w) {
int vla[w->member * 1];
}
void top(struct wrapper wrp) {
if (wrp.member != 0)
return;
create_vla(&wrp);
}
Basically, the VLA size checker detects the bug and creates a bug reports - and also calls bugreporter::trackExpressionValue
on the report to track how we ended up with a zero size for the VLA.
That will call Tracker::create(Report)->track(E, InputNode, Opts);
then that inside will decompose the expression - in our case the multiplication binop, and track the left operand of it.
It is done by:
const Expr *Inner = peelOffOuterExpr(E, N);
const ExplodedNode *LVNode = findNodeForExpression(N, Inner);
Handler->handle(Inner, N, LVNode, Opts);
Remember that the ExplodedNode that contains the report is protected from node reclamation (ExplodedGraph::shouldCollect
), but the LVNode
returned by findNodeForExpression
is not guaranteed to live long. And in this case, node relamation will indeed reuse that ExplodedNode for a different node, turning this LVNode
dangling eventually.
Now, the problem is that we will construct a BugReportVisitor saving this pointer, which will be dangling by the time we get to the report construction - when the bug report visitors kick in. It will refer to a completely different ExplodedNode :D
I'd say we could solve this by "defering" the tracking in trackExpressionValue
and exploit the fact that the report node is never reclaimed.
What I'd prefer is to simply disable node reclamation and that would bring the immutability assumption back.
That said, I think this happens so infrequently, without much actual consequences (without assertions), that I don't think this bug deserves an immediate action.
Users that trip on this, we can suggest setting the graph-trim-interval=0
as a mitigation.
What I'd prefer is to simply disable node reclamation and that would bring the immutability assumption back.
I agree that node reclamation is inelegant, and I would like to see it eliminated from the sources unless that change has a prohibitive impact on performance. What prevents us from disabling it?
What prevents us from disabling it?
We have tried to disable it and ran the analysis on a couple of hundreds of open-source projects of various sizes. For most of them, the effect was on par with regular CI noise, but a couple of projects experienced a noticeable slowdown. In particular, suffer translation units with large static arrays. For example, llvm/lib/Target/MSP430/MSP430ISelDAGToDAG.cpp analysis takes twice the time if node reclamation is disabled.
Some other performance losers:
| project | translation unit | w/o | with | noise |
|-------------+---------------------------------------------------+------+------+-------|
| alibaba-mnn | express/MathOp.cpp | 35s | 24s | 24s |
| Clang | llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp | 33s | 26s | 50s |
| Clang | llvm/lib/Target/BPF/BPFISelDAGToDAG.cpp | 30s | 23s | 1s |
| Clang | llvm/lib/Target/MSP430/MSP430ISelDAGToDAG.cpp | 50s | 32s | 5s |
| CPython | Python/compile.c | 34s | 22s | 10s |
| Firefox | js/src/vm/Interpreter.cpp | 148s | 145s | 4s |
| HHVM | hphp/runtime/ext/fileinfo/libmagic/apprentice.cpp | 94s | 65s | 1s |
| Linux | drivers/media/usb/gspca/topro.c | 190s | 56s | 21s |
| PHP | ext/opcache/jit/dynasm/minilua.c | 53s | 22s | 21s |
| Ruby | enc/unicode.c | 55s | 49s | 10s |
| taskflow | unittests/algorithms.cpp | 30s | 19s | 16s |
| taskflow | unittests/control_flow.cpp | 47s | 29s | 31s |
| taskflow | unittests/pipelines.cpp | 54s | 43s | 23s |
| taskflow | unittests/scalable_pipelines.cpp | 43s | 24s | 21s |
IIUC graph-trim-interval is about saving memory, rather that saving CPU time (even thought those might be a bit related). Did you perhaps measure the impact on memory usage as well, or only speed?
I did not measure memory impact as we are usually less concerned with memory footprint as long as it is reasonable, and I can only say the footprint remained reasonable.
I did not measure memory impact as we are usually less concerned with memory footprint as long as it is reasonable, and I can only say the footprint remained reasonable.
Ok, thanks.
@steakhal Could you send me the example files where you observed the large slowdowns when you disabled the node reclamation? We are planning that somebody from my team will start working on analyzing them (probably not me, but I'll contribute if needed).
By the way, what was the average performance impact of this change? I see that there were some translation units with +100% or more runtime; but what level of slowdown did you observe on whole projects or if you take an average over all the projects that you analyzed?
@steakhal Could you send me the example files where you observed the large slowdowns when you disabled the node reclamation? We are planning that somebody from my team will start working on analyzing them (probably not me, but I'll contribute if needed).
I'll prepare the files and try to attach them to this issue.
By the way, what was the average performance impact of this change? I see that there were some translation units with +100% or more runtime; but what level of slowdown did you observe on whole projects or if you take an average over all the projects that you analyzed?
Across all TU of all analyzed projects, on average there was no noticeable impact on the analysis time. The Standard deviation of the impact is ~4s.
Here are several runs of the same 12 most impacted projects (b2b is the run-to-run difference in CSA times without any changes in the analyzer):
Project b2b 466-642 616-583 462-634 561-496
FlatBuffers +65.33ms +287.90ms +471.71ms -1069ms -1858ms
bash-linaro-aarch64 +336.80ms +382.42ms +1461.31ms -107ms -52ms
ccache -170.31ms +339.15ms +498.57ms +20ms +305ms
fmt -1671.79ms +1219.31ms +18097.66ms -820ms +3391ms
libmicrohttpd -5.60ms +133.42ms +127.73ms +138ms +74ms
libuv -39.41ms +450.00ms +112.68ms -12ms +100ms
nnn -1268.00ms +5910.00ms +4063.00ms -4867ms +2144ms
rocksdb +143.71ms +391.11ms +919.06ms +408ms +850ms
taskflow -5027.56ms +6565.17ms +2718.92ms +13398ms +12486ms
tdengine +59.11ms +307.10ms +373.49ms +1840ms -83ms
tesseract +105.49ms +4079.23ms +115.40ms +118ms -744ms
zabbix +65.86ms +499.26ms +1426.33ms -81ms +279ms
As you can see, it is difficult to make reliable conclusions out of such noisy data. However, the few most impacted TUs consistently demonstrate a regression.
@steakhal Could you send me the example files where you observed the large slowdowns when you disabled the node reclamation?
Here you go: repros.tar.gz In the tarball you find 3 TUs that consistently run slower with node reclamation disabled, as well as, "run-clang.sh" that demonstrates the regression in analysis time.
Here are the analysis times on my machine with clang-18:
For me the repros.tar.gz also demonstrates the issue, on llvmorg-19.1.1
:
Here is how the FlameGraph looks like without reclamation for ruby-unicode.c:
Here how with node reclamation:
Notice the frames containing the word EnvironmentEntry
are highlighted with purple on both charts. This is a sign that the environment immutable tree grows huge without reclamation, probably wasting a lot of time doing lookups and insertions (and tree-canonicalizations).
We also spend a HUGE amount of time in frames containing BindingKey
, which means we also suffer in the Store for the same reason (with and without node reclamation), so it anyways makes sense to look at these immutable trees - eventually, but that doesn't seem to be much worse by disabling node reclamation.
Interestingly, clang spends its time withing frames EnvironmentEntry|BindingKey
in 82-88% of its time in this example.
To get FlameGraphs, do this:
perf record -F99 -a -g -- clang -cc1 <command>...
perf script | ~/git/FlameGraph/stackcollapse-perf.pl > perf.folded
~/git/FlameGraph/flamegraph.pl perf.folded > with-reclamation.svg
firefox with-reclamation.svg
Thanks for uploading repros.tar.gz
!
I tried to start experimenting with it (on up-to date main clang), but interestingly I ran into an unrelated crash on ruby-unicode.c
: https://github.com/ericsson/llvm-project/commit/b1e4ddbfad5baa1ba243a38653aa8c39715371b0 .
Edit: I created a github issue to track this crash: https://github.com/llvm/llvm-project/issues/111147 . I briefly looked at the code of BoolAssignmentChecker.cpp
and the assertion itself and I didn't spot any obvious cause.
Hi Everyone! Today I talked with @NagyDonat about this issue and I have started investigating it.
@steakhal @haoNoQ @Xazax-hun What would you think about setting node reclamation to off-by-default as a temporary countermeasure to avoid this crash?
I hope that with @tigbr we can find a solution that removes the reclamation and ensures good performance on all translation units, but while we're working on that, it would be good to ensure that this crash doesn't trouble the users who use the main
clang -- while temporary stomaching a measurable, but not highly visible slowdown is acceptable IMO.
I can stand by disabling node reclamation given that our measurements were correct - so I'd like someone else to repeat the measurements at some scale to gain confidence that our decision is based of real datapoints. That measurements should also include ffmpeg (as a project showcasing the slowdown), and confirm that even in worst case translation unit analysis times behave as we described.
Assuming the observed numbers align with the expectation, we can turn it off by default.
Note that this is not just about performance, but about memory. If the change in the memory consumption is significant, that might be the difference whether some users can do -j 12
or -j 8
before running out of memory which would be a huge difference in the throughput. I don't think we can make this decision without looking at the memory consumption.
I created PR #111843 for the temporary workaround that I suggested, let's continue discussion about it there. (Obviously I won't merge it until there's a consensus about it, but let's leave this ticket for discussing the permanent solution.)
Note that #111843 was closed without action since "It turns out that this commit increases the runtime by ~3% on a set of 6-8 open source projects that we used as a benchmark." so that wasn't a viable way forward.
So what is the actual consequence when we are hitting the assert (if we disable the assert). I thought it would indicate that the domtree is broken in some way(?). Could it lead to crashes? Could it lead to false/missing warnings from static analyzer? Is it indicating a secure coding problem?
Maybe you need to implement a special version of the domtree ("FlawedGenDomTree"), that doesn't implement the assert, and use that for static analyzer if you think it is safe to keep executing beyond that assert for static analyzer?
llvm commit: 4f075086e7b8d Reproduce with:
clang -cc1 -analyze -analyzer-checker=core bbi-98084.c
Result:This starts happening with 5ce47a581350
and I originally reported it here https://github.com/llvm/llvm-project/pull/102782#issuecomment-2291055153 but that didn't really lead anywhere so now also writing a proper issue to not only have the problem hidden in that merged PR.
bbi-98084.c.gz