[libLTO, IPGO] Slowdown after b1554fe

omern1 commented 3 months ago

While building a very large commonly used C++ game engine with -flto=full and -fprofile-instr-generate we're seeing a very large slowdown in IR link times, causing the total LTO build time to go from 40 minutes to over 2 hours. Unfortunately I'm unable to share the source for this codebase.

A bisection of the issue pointed to commit b1554fe. CC @igorkudrin.

The patch causes the size of the RAUWWorklist in the IRLinker to increase considerably and a large number of the entries are descriptors for IPGO counters (@__profd_*) which are referenced by the @llvm.compiler.used global array, in the case of this program the @llvm.compiler.used list grows to 400,000+ elements. Each time RAUW is called for one one of the descriptors we end up spending a lot of time iterating over all of the elements of the @llvm.compiler.used list to generate a key, and RAUW is called many times:

[Inline Frame] llvmLTO.dll!llvm::ConstantAggrKeyType<llvm::ConstantArray>::ConstantAggrKeyType(const llvm::ConstantArray * C, llvm::SmallVectorImpl<llvm::Constant *> & Storage)
llvmLTO.dll!llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo::getHashValue(const llvm::ConstantArray * CP)     
llvmLTO.dll!llvm::DenseMapBase<llvm::DenseMap<llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>,llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>::LookupBucketFor<const llvm::ConstantArray *>(const llvm::ConstantArray * const & Val, const llvm::detail::DenseSetPair<llvm::ConstantArray *> * & FoundBucket)     
llvmLTO.dll!llvm::DenseMapBase<llvm::DenseMap<llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>,llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>::LookupBucketFor<const llvm::ConstantArray *>(const llvm::ConstantArray * const & Val, llvm::detail::DenseSetPair<llvm::ConstantArray *> * & FoundBucket)   
llvmLTO.dll!llvm::DenseMapBase<llvm::DenseMap<llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>,llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>::find(const llvm::ConstantArray * Val)  
[Inlined Frame] llvmLTO.dll!llvm::detail::DenseSetImpl<llvm::ConstantArray *,llvm::DenseMap<llvm::ConstantArray *,llvm::detail::DenseSetEmpty,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo,llvm::detail::DenseSetPair<llvm::ConstantArray *>>,llvm::ConstantUniqueMap<llvm::ConstantArray>::MapInfo>::find(const llvm::ConstantArray * V)  
llvmLTO.dll!llvm::ConstantUniqueMap<llvm::ConstantArray>::remove(llvm::ConstantArray * CP)  
llvmLTO.dll!llvm::ConstantUniqueMap<llvm::ConstantArray>::replaceOperandsInPlace(llvm::ArrayRef<llvm::Constant *> Operands, llvm::ConstantArray * CP, llvm::Value * From, llvm::Constant * To, unsigned int NumUpdated, unsigned int OperandNo)     
llvmLTO.dll!llvm::ConstantArray::handleOperandChangeImpl(llvm::Value * From, llvm::Value * To)  
llvmLTO.dll!llvm::Constant::handleOperandChange(llvm::Value * From, llvm::Value * To)   
llvmLTO.dll!llvm::Value::doRAUW(llvm::Value * New, llvm::Value::ReplaceMetadataUses ReplaceMetaUses)    
[Inlined Frame] llvmLTO.dll!`anonymous namespace'::IRLinker::flushRAUWWorklist()

The following is a flamegraph produced by running llvm-link on the bitcode, hopefully it makes the situation clearer: broken

igorkudrin commented 3 months ago

It doesn't seem to me that #69143 by itself produces additional entries in @llvm.compiler.used or prevents them from being removed, but, it is possible that it provokes some optimization pass to do so. Maybe you can find an isolated example to show the difference?

ConstantArray::handleOperandChangeImpl() scans all entries of the array on each operation, so the bigger the list the longer it takes to update it. As far as I can remember, -fprofile-instr-generate creates at least one entry in this list for each performance counter, so for a large application it is expected that the array will be huge when all modules are linked together for FullLTO.

Have you considered using ThinLTO for the run that collects the profile, and only running FullLTO in the final stage?

Another option might be to somehow optimize ConstantArray to handle such huge lists smoothly.

omern1 commented 2 months ago

Thanks for looking at this @igorkudrin.

It doesn't seem to me that #69143 by itself produces additional entries in @llvm.compiler.used or prevents them from being removed, but, it is possible that it provokes some optimization pass to do so. Maybe you can find an isolated example to show the difference?

Another option might be to somehow optimize ConstantArray to handle such huge lists smoothly.

I'll clarify, from my analysis it appears that the patch causes a lot more elements to be added to the RAUW worklist in the IR linker and a lot of them are @__profd_* globals which are referenced by @llvm.compiler.used, because of that each time there's a call to RAUW we end up in llvm::ConstantArray::handleOperandChangeImpl() which for the @llvm.compiler.used list is very expensive (due to the number of elements being extremely large).

I only have a cursory understanding of how the IRLinker oprates though so I don't completely understand why @__profd_* globals which are marked available_externally need to be linked-in (which is how they end up in the RAUW worklist) rather than directly being copied in. This happens in copyGlobalValueProto in the IRLinker.

Optimizing ConstantArray is probably worth it and I'm looking into that but I pinged you because I was wondering if there's a way to prevent the above situation from arising in the first place.

I'll come up with a reproducer shortly.

Have you considered using ThinLTO for the run that collects the profile, and only running FullLTO in the final stage?

This is certainly a good workaround but I'm still interested in preventing the situation from arising if we can.

omern1 commented 2 months ago

I've now managed to reproduce this with LLVM, I built LLVM with:

-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=all -DLLVM_USE_LINKER=lld -DCMAKE_C_FLAGS='-fprofile-instr-generate -flto=full' -DCMAKE_CXX_FLAGS='-fprofile-instr-generate -flto=full' -DCMAKE_EXE_LINKER_FLAGS='-Wl,--reproduce=repro.tar -fuse-ld=lld'

Then extracted the tarball and executed llvm-link on the lib directory extracted from the tarball. llvm-link has been running for the past 10 hours on my machine with b1554fe applied. Without the patch it finishes in 9 minutes.

llvm / llvm-project

[libLTO, IPGO] Slowdown after b1554fe #105836