Open omern1 opened 3 months ago
It doesn't seem to me that #69143 by itself produces additional entries in @llvm.compiler.used
or prevents them from being removed, but, it is possible that it provokes some optimization pass to do so. Maybe you can find an isolated example to show the difference?
ConstantArray::handleOperandChangeImpl()
scans all entries of the array on each operation, so the bigger the list the longer it takes to update it. As far as I can remember, -fprofile-instr-generate
creates at least one entry in this list for each performance counter, so for a large application it is expected that the array will be huge when all modules are linked together for FullLTO.
Have you considered using ThinLTO for the run that collects the profile, and only running FullLTO in the final stage?
Another option might be to somehow optimize ConstantArray
to handle such huge lists smoothly.
Thanks for looking at this @igorkudrin.
It doesn't seem to me that #69143 by itself produces additional entries in
@llvm.compiler.used
or prevents them from being removed, but, it is possible that it provokes some optimization pass to do so. Maybe you can find an isolated example to show the difference?Another option might be to somehow optimize ConstantArray to handle such huge lists smoothly.
I'll clarify, from my analysis it appears that the patch causes a lot more elements to be added to the RAUW worklist in the IR linker and a lot of them are @__profd_*
globals which are referenced by @llvm.compiler.used
, because of that each time there's a call to RAUW we end up in llvm::ConstantArray::handleOperandChangeImpl() which for the @llvm.compiler.used
list is very expensive (due to the number of elements being extremely large).
I only have a cursory understanding of how the IRLinker oprates though so I don't completely understand why @__profd_*
globals which are marked available_externally
need to be linked-in (which is how they end up in the RAUW worklist) rather than directly being copied in. This happens in copyGlobalValueProto in the IRLinker.
Optimizing ConstantArray is probably worth it and I'm looking into that but I pinged you because I was wondering if there's a way to prevent the above situation from arising in the first place.
I'll come up with a reproducer shortly.
Have you considered using ThinLTO for the run that collects the profile, and only running FullLTO in the final stage?
This is certainly a good workaround but I'm still interested in preventing the situation from arising if we can.
I've now managed to reproduce this with LLVM, I built LLVM with:
-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=all -DLLVM_USE_LINKER=lld -DCMAKE_C_FLAGS='-fprofile-instr-generate -flto=full' -DCMAKE_CXX_FLAGS='-fprofile-instr-generate -flto=full' -DCMAKE_EXE_LINKER_FLAGS='-Wl,--reproduce=repro.tar -fuse-ld=lld'
Then extracted the tarball and executed llvm-link
on the lib
directory extracted from the tarball. llvm-link
has been running for the past 10 hours on my machine with b1554fe applied. Without the patch it finishes in 9 minutes.
While building a very large commonly used C++ game engine with
-flto=full
and-fprofile-instr-generate
we're seeing a very large slowdown in IR link times, causing the total LTO build time to go from 40 minutes to over 2 hours. Unfortunately I'm unable to share the source for this codebase.A bisection of the issue pointed to commit b1554fe. CC @igorkudrin.
The patch causes the size of the
RAUWWorklist
in the IRLinker to increase considerably and a large number of the entries are descriptors for IPGO counters (@__profd_*
) which are referenced by the@llvm.compiler.used
global array, in the case of this program the@llvm.compiler.used
list grows to 400,000+ elements. Each time RAUW is called for one one of the descriptors we end up spending a lot of time iterating over all of the elements of the@llvm.compiler.used
list to generate a key, and RAUW is called many times:The following is a flamegraph produced by running
llvm-link
on the bitcode, hopefully it makes the situation clearer: