Closed nathanchance closed 1 year ago
@llvm/issue-subscribers-bolt
Thanks for sending the detailed repro. On the surface indeed it looks similar to the other instrumentation bug (https://github.com/llvm/llvm-project/issues/53994).
@nathanchance Can you please clarify what OS and versions of clang/lld were used here? We couldn't reproduce the assertion but that might be due to some system differences.
@aaupov I believe I would have reproduced this on Arch Linux, as that is my primary distribution, which currently has the following versions:
$ /usr/bin/clang --version
clang version 13.0.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ /usr/bin/ld.lld --version
LLD 13.0.1 (compatible with GNU linkers)
@nathanchance Thanks! Was able to repro the bug now with Ubuntu 20.04 and clang/lld 13.
Hi @aaupov, we're trying to use BOLT for optimizing the Rust compiler and we're hitting similar issues when trying to BOLTify LLVM. I wonder if there are any ongoing activities/investigation regarding the PGO clang/llvm crash? Thanks!
Will put up the fix soon. Let’s see if it solves your problem too.
Any updates? I tried it with 15.0.0-rc1
, but unfortunately the instrumented libLLVM.so
still segfaults for us.
I was able to repro and find the root cause. The symbol that represents the end of a table in .rodata is being colocated with the start of a jump table from another function, and BOLT moves that jump table. This causes the symbol representing the end of the table to be moved as well. The new location is a few MB away in distance, significantly increasing the size of this table as perceived by the application. The application (clang) then crashes scanning values in the table -- because it has the wrong end-of-table address, the loop that scans this table goes out of bounds until it reaches an unmapped address in memory and then segfaults.
We're working on a fix.
Please backport to 15.0.x.
/cherry-pick 4f158995b9cddae392bfb5989af8c83101ae0789
/cherry-pick 4f158995b9cddae392bfb5989af8c83101ae0789
Error: Command failed due to missing milestone.
Please backport to 15.0.x.
Is there a plan for a 15.0.7 release? Otherwise, I think it's too late. 15.0.6 may have been the final 15.0.X release.
I don't understand. Will there be no clang maintenance releases until 16.0.0? That's almost a year away.
That's almost a year away.
I do not think that is far away. The release documentation states release/16.x
should be cut January 24th and the final release should be six weeks after that. Even accounting for an extra month and a half of delays for some reason, that is still just four months away.
You could always ask your LLVM distributor to cherry pick this patch if you are not building it yourself.
Ah, I assumed a yearly cycle, perhaps I confused it with gcc.
Still, to keep at least the last release supported, not dead for 4.5 months.
Of course I can ask Fedora to backport the patch, but it's much nicer if the experts decide which patch merits backports, and the entire community benefits.
I am attempting to wire up BOLT support into our toolchain build script. However, when
clang
is compiled with profile guided optimization, it crashes after it has been instrumented with BOLT. I noticed this when buildingscripts/dtc/srcpos.c
in the Linux kernel, which I reduced below. I see #53994 but the crash is different so I figured I would report it and let someone else mark it as a duplicate.I was able to reproduce at bff8356b1969d2edd02e22c73d1c3d386f862937 with the following steps on two different x86_64 machines. If assertions are enabled (
-DLLVM_ENABLE_ASSERTIONS=ON
on all stages), there is no crash..prof
file from raw profilesclang
binary withllvm-bolt
clang
will crash but the original will not.srcpos.i
: