[AMDGPU] Add AMDGPU-specific module splitting

Pierre-vh commented 6 months ago

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

llvmbot commented 6 months ago

@llvm/pr-subscribers-lto

Author: Pierre van Houtryve (Pierre-vh)

Changes

(See #83128 to review first commit) This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen. --- Patch is 59.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89245.diff 22 Files Affected: - (modified) llvm/include/llvm/Target/TargetMachine.h (+12) - (modified) llvm/lib/LTO/LTOBackend.cpp (+9-4) - (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp (+733) - (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h (+30) - (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8) - (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h (+4) - (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1) - (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll (+46) - (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize.ll (+37) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll (+54) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependencies.ll (+50) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-duplication.ll (+41) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-external.ll (+43) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-indirect.ll (+69) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll (+42) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll (+44) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-load-balancing.ll (+75) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-no-dependencies.ll (+39) - (added) llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll (+98) - (added) llvm/test/tools/llvm-split/AMDGPU/lit.local.cfg (+2) - (modified) llvm/tools/llvm-split/CMakeLists.txt (+7) - (modified) llvm/tools/llvm-split/llvm-split.cpp (+76-25) ``````````diff diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h index ceb371bdc73480..48ea3cfe02775b 100644 --- a/llvm/include/llvm/Target/TargetMachine.h +++ b/llvm/include/llvm/Target/TargetMachine.h @@ -418,6 +418,18 @@ class TargetMachine { virtual unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const { return 0; } + + /// Entry point for module splitting. Targets can implement custom module + /// splitting logic, mainly used by LTO for --lto-partitions. + /// + /// \returns `true` if the module was split, `false` otherwise. When `false` + /// is returned, it is assumed that \p ModuleCallback has never been called + /// and \p M has not been modified. + virtual bool splitModule( + Module &M, unsigned NumParts, + function_ref MPart)> ModuleCallback) const { + return false; + } }; /// This class describes a target machine that is implemented with the LLVM diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp index 71e8849dc3cc91..d4b89ede2d7134 100644 --- a/llvm/lib/LTO/LTOBackend.cpp +++ b/llvm/lib/LTO/LTOBackend.cpp @@ -436,8 +436,7 @@ static void splitCodeGen(const Config &C, TargetMachine *TM, unsigned ThreadCount = 0; const Target *T = &TM->getTarget(); - SplitModule( - Mod, ParallelCodeGenParallelismLevel, + const auto HandleModulePartition = [&](std::unique_ptr MPart) { // We want to clone the module in a new context to multi-thread the // codegen. We do it by serializing partition modules to bitcode @@ -469,8 +468,14 @@ static void splitCodeGen(const Config &C, TargetMachine *TM, // Pass BC using std::move to ensure that it get moved rather than // copied into the thread's context. std::move(BC), ThreadCount++); - }, - false); + }; + + // Try target-specific module splitting first, then fallback to the default. + if (!TM->splitModule(Mod, ParallelCodeGenParallelismLevel, + HandleModulePartition)) { + SplitModule(Mod, ParallelCodeGenParallelismLevel, HandleModulePartition, + false); + } // Because the inner lambda (which runs in a worker thread) captures our local // variables, we need to wait for the worker threads to terminate before we diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp new file mode 100644 index 00000000000000..fa47d494f04148 --- /dev/null +++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp @@ -0,0 +1,733 @@ +//===- AMDGPUSplitModule.cpp ----------------------------------------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// +// +/// \file Implements a module splitting algorithm designed to support the +/// FullLTO --lto-partitions option for parallel codegen. This is completely +/// different from the common SplitModule pass, as this system is designed with +/// AMDGPU in mind. +/// +/// The basic idea of this module splitting implementation is the same as +/// SplitModule: load-balance the module's functions across a set of N +/// partitions to allow parallel codegen. However, it does it very +/// differently than the target-agnostic variant: +/// - Kernels are used as the module's "roots". +/// They're known entry points on AMDGPU, and everything else is often +/// internal only. +/// - Each kernel has a set of dependencies, and when a kernel and its +/// dependencies is considered "big", we try to put it in a partition where +/// most dependencies are already imported, to avoid duplicating large +/// amounts of code. +/// - There's special care for indirect calls in order to ensure +/// AMDGPUResourceUsageAnalysis can work correctly. +/// +/// This file also includes a more elaborate logging system to enable +/// users to easily generate logs that (if desired) do not include any value +/// names, in order to not leak information about the source file. +/// Such logs are very helpful to understand and fix potential issues with +/// module splitting. +// +//===----------------------------------------------------------------------===// + +#include "AMDGPUSplitModule.h" +#include "AMDGPUTargetMachine.h" +#include "Utils/AMDGPUBaseInfo.h" +#include "llvm/ADT/DenseMap.h" +#include "llvm/ADT/SmallVector.h" +#include "llvm/ADT/StringExtras.h" +#include "llvm/ADT/StringRef.h" +#include "llvm/Analysis/CallGraph.h" +#include "llvm/Analysis/TargetTransformInfo.h" +#include "llvm/IR/Function.h" +#include "llvm/IR/Instruction.h" +#include "llvm/IR/Module.h" +#include "llvm/IR/User.h" +#include "llvm/IR/Value.h" +#include "llvm/Support/Casting.h" +#include "llvm/Support/Debug.h" +#include "llvm/Support/FileSystem.h" +#include "llvm/Support/Process.h" +#include "llvm/Support/SHA256.h" +#include "llvm/Support/raw_ostream.h" +#include "llvm/Transforms/Utils/Cloning.h" +#include +#include +#include +#include +#include +#include + +using namespace llvm; + +#define DEBUG_TYPE "amdgpu-split-module" + +namespace { + +static cl::opt LargeKernelFactor( + "amdgpu-module-splitting-large-kernel-threshold", cl::init(2.0), cl::Hidden, + cl::desc( + "consider a kernel as large and needing special treatment when it " + "exceeds the average cost of a partition by this factor; e;g. 2.0 " + "means if the kernel and its dependencies is 2 times bigger than " + "an average partition; 0 disables large kernels handling entirely")); + +static cl::opt LargeKernelOverlapForMerge( + "amdgpu-module-splitting-large-kernel-merge-overlap", cl::init(0.8), + cl::Hidden, + cl::desc("defines how much overlap between two large kernel's dependencies " + "is needed to put them in the same partition")); + +static cl::opt NoExternalizeGlobals( + "amdgpu-module-splitting-no-externalize-globals", cl::Hidden, + cl::desc("disables externalization of global variable with local linkage; " + "may cause globals to be duplicated which increases binary size")); + +static cl::opt + LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden, + cl::desc("output directory for AMDGPU module splitting logs")); + +static cl::opt + LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden, + cl::desc("hash value names before printing them in the AMDGPU " + "module splitting logs")); + +using CostType = InstructionCost::CostType; +using PartitionID = unsigned; + +static std::string getName(const Value &V) { + static std::optional HideNames; + if (!HideNames) { + if (LogPrivate.getNumOccurrences()) + HideNames = LogPrivate; + else { + const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE"); + HideNames = (EV.value_or("0") != "0"); + } + } + + if (!*HideNames) + return V.getName().str(); + return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())), + /*LowerCase*/ true); +} + +/// Main logging helper. +/// +/// Logging can be configured by the following environment variable. +/// AMD_SPLIT_MODULE_LOG_DIR= +/// If set, uses as the directory to write logfiles to +/// each time module splitting is used. +/// AMD_SPLIT_MODULE_LOG_PRIVATE +/// If set to anything other than zero, all names are hidden. +/// +/// Both environment variables have corresponding CL options which +/// takes priority over them. +/// +/// Any output printed to the log files is also printed to dbgs() when -debug is +/// used and LLVM_DEBUG is defined. +/// +/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic +/// cannot be removed from the code (by building without debug). This probably +/// has a small performance cost because if some computation/formatting is +/// needed for logging purpose, it may be done everytime only to be ignored +/// by the logger. +/// +/// As this pass only runs once and is not doing anything computationally +/// expensive, this is likely a reasonable trade-off. +/// +/// If some computation should really be avoided when unused, users of the class +/// can check whether any logging will occur by using the bool operator. +/// +/// \code +/// if (SML) { +/// // Executes only if logging to a file or if -debug is available and +/// used. +/// } +/// \endcode +class SplitModuleLogger { +public: + SplitModuleLogger(const Module &M) { + std::string LogDir = LogDirOpt; + if (LogDir.empty()) + LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or(""); + + // No log dir specified means we don't need to log to a file. + // We may still log to dbgs(), though. + if (LogDir.empty()) + return; + + if (!sys::fs::is_directory(LogDir)) { + report_fatal_error("invalid AMDGPU split module log directory: '" + + Twine(LogDir) + "' is not a directory", + /*CrashDiag=*/false); + } + + // If a log directory is specified, create a new file with a unique name in + // that directory. + SmallString<0> FilePath; + int Fd; + std::string LogFile = (LogDir + "/" + "Module-%%-%%-%%-%%-%%-%%-%%.txt"); + if (auto Err = sys::fs::createUniqueFile(LogFile, Fd, FilePath)) { + dbgs() << LogFile << "\n"; + std::string Msg = + "Failed to create log file at '" + LogDir + "': " + Err.message(); + report_fatal_error(StringRef(Msg), + /*CrashDiag=*/false); + } + + FileOS = std::make_unique(Fd, /*shouldClose*/ true); + } + + bool hasLogFile() const { return FileOS != nullptr; } + + raw_ostream &logfile() { + assert(FileOS && "no logfile!"); + return *FileOS; + } + + /// \returns true if this SML will log anything either to a file or dbgs(). + /// Can be used to avoid expensive computations that are ignored when logging + /// is disabled. + operator bool() const { + return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE)); + } + +private: + std::unique_ptr FileOS; +}; + +template +static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) { + static_assert( + !std::is_same_v, + "do not print values to logs directly, use handleName instead!"); + LLVM_DEBUG(dbgs() << Val); + if (SML.hasLogFile()) + SML.logfile() << Val; + return SML; +} + +/// Calculate the cost of each function in \p M +/// \param SML Log Helper +/// \param TM TargetMachine instance used to retrieve TargetTransformInfo. +/// \param M Module to analyze. +/// \param CostMap[out] Resulting Function -> Cost map. +/// \return The module's total cost. +static CostType +calculateFunctionCosts(SplitModuleLogger &SML, const AMDGPUTargetMachine &TM, + Module &M, + DenseMap &CostMap) { + CostType ModuleCost = 0; + CostType KernelCost = 0; + + for (auto &Fn : M) { + if (Fn.isDeclaration()) + continue; + + CostType FnCost = 0; + auto TTI = TM.getTargetTransformInfo(Fn); + + for (auto &BB : Fn) { + for (auto &I : BB) { + auto Cost = + TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize); + assert(Cost != InstructionCost::getMax()); + // Assume expensive if we can't tell the cost of an instruction. + CostType CostVal = + Cost.getValue().value_or(TargetTransformInfo::TCC_Expensive); + assert((FnCost + CostVal) >= FnCost && "Overflow!"); + FnCost += CostVal; + } + } + + assert(FnCost != 0); + + CostMap[&Fn] = FnCost; + assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!"); + ModuleCost += FnCost; + + if (AMDGPU::isKernelCC(&Fn)) + KernelCost += FnCost; + } + + CostType FnCost = (ModuleCost - KernelCost); + SML << "=> Total Module Cost: " << ModuleCost << "\n" + << " => KernelCost: " << KernelCost << " (" + << format("%0.2f", (float(KernelCost) / ModuleCost) * 100) << "%)\n" + << " => FnsCost: " << FnCost << " (" + << format("%0.2f", (float(FnCost) / ModuleCost) * 100) << "%)\n"; + + return ModuleCost; +} + +/// When a kernel or any of its callees performs an indirect call, this function +/// takes over \ref addAllDependencies and adds all potentially callable +/// functions to \p Fns so they can be counted as dependencies of the kernel. +/// +/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the +/// presence of an indirect call, the function's resource usage is the same as +/// the most expensive function in the module. +/// \param M The module. +/// \param Fns[out] Resulting list of functions. +static void addAllIndirectCallDependencies(const Module &M, + DenseSet &Fns) { + for (const auto &Fn : M) { + if (!Fn.isDeclaration() && !AMDGPU::isEntryFunctionCC(Fn.getCallingConv())) + Fns.insert(&Fn); + } +} + +/// Adds the functions that \p Fn may call to \p Fns, then recurses into each +/// callee until all reachable functions have been gathered. +/// +/// \param SML Log Helper +/// \param CG Call graph for \p Fn's module. +/// \param Fn Current function to look at. +/// \param Fns[out] Resulting list of functions. +/// \param HadIndirectCall[out] Set to true if an indirect call was seen at some +/// point, either in \p Fn or in one of the function it calls. When that +/// happens, we fall back to adding all callable functions inside \p Fn's module +/// to \p Fns. +/// \param HadExternalCall[out] Set to true if a call to an external function +/// was seen at some point, either in \p Fn or in one of the function it calls +static void addAllDependencies(SplitModuleLogger &SML, const CallGraph &CG, + const Function &Fn, + DenseSet &Fns, + bool &HadIndirectCall, bool &HadExternalCall) { + assert(!Fn.isDeclaration()); + + const Module &M = *Fn.getParent(); + SmallVector WorkList({&Fn}); + while (!WorkList.empty()) { + const auto &CurFn = *WorkList.pop_back_val(); + + // Scan for an indirect call. If such a call is found, we have to + // conservatively assume this can call all non-entrypoint functions in the + // module. + for (const auto &BB : CurFn) { + for (const auto &I : BB) { + const auto *CB = dyn_cast(&I); + if (!CB || !CB->isIndirectCall()) + continue; + + SML << "Indirect call detected in " << getName(CurFn) + << " - treating all non-entrypoint functions as " + "potential dependencies\n"; + + // TODO: Print an ORE as well ? + addAllIndirectCallDependencies(M, Fns); + HadIndirectCall = true; + return; + } + } + + for (auto &CGEntry : *CG[&CurFn]) { + auto *Callee = CGEntry.second->getFunction(); + if (!Callee) + continue; + + assert(!AMDGPU::isKernelCC(Callee)); + + if (Callee->isDeclaration()) + continue; + + if (Callee->hasExternalLinkage()) + HadExternalCall = true; + + auto [It, Inserted] = Fns.insert(Callee); + if (Inserted) + WorkList.push_back(Callee); + } + } +} + +/// Contains information about a kernel and its dependencies. +struct KernelWithDependencies { + KernelWithDependencies(SplitModuleLogger &SML, CallGraph &CG, + const DenseMap &FnCosts, + const Function *Fn) + : Fn(Fn) { + addAllDependencies(SML, CG, *Fn, Dependencies, HasIndirectCall, + HasExternalCall); + TotalCost = FnCosts.at(Fn); + for (const auto *Dep : Dependencies) + TotalCost += FnCosts.at(Dep); + } + + const Function *Fn = nullptr; + DenseSet Dependencies; + /// Whether \p Fn or any of its \ref Dependencies contains an indirect call. + bool HasIndirectCall = false; + /// Whether \p Fn or any of its \ref Dependencies contains a call to a + /// function with external linkage. + bool HasExternalCall = false; + + CostType TotalCost = 0; + + /// \returns true if this kernel and its dependencies can be considered large + /// according to \p Threshold. + bool isLarge(CostType Threshold) const { + return TotalCost > Threshold && !Dependencies.empty(); + } +}; + +/// Calculates how much overlap there is between \p A and \p B. +/// \return A number between 0.0 and 1.0, where 1.0 means A == B and 0.0 means A +/// and B have no shared elements. Kernels do not count in overlap calculation. +static float calculateOverlap(const DenseSet &A, + const DenseSet &B) { + DenseSet Total; + for (const auto *F : A) { + if (!AMDGPU::isKernelCC(F)) + Total.insert(F); + } + + if (Total.empty()) + return 0.0f; + + unsigned NumCommon = 0; + for (const auto *F : B) { + if (AMDGPU::isKernelCC(F)) + continue; + + auto [It, Inserted] = Total.insert(F); + if (!Inserted) + ++NumCommon; + } + + return float(NumCommon) / Total.size(); +} + +/// Performs all of the partitioning work on \p M. +/// \param SML Log Helper +/// \param M Module to partition. +/// \param NumParts Number of partitions to create. +/// \param ModuleCost Total cost of all functions in \p M. +/// \param FnCosts Map of Function -> Cost +/// \param WorkList Kernels and their dependencies to process in order. +/// \returns The created partitions (a vector of size \p NumParts ) +static std::vector> +doPartitioning(SplitModuleLogger &SML, Module &M, unsigned NumParts, + CostType ModuleCost, + const DenseMap &FnCosts, + const SmallVector &WorkList) { + + SML << "\n--Partitioning Starts--\n"; + + // Calculate a "large kernel threshold". When more than one kernel's total + // import cost exceeds this value, we will try to merge it with other, + // similarly large kernels. + // + // e.g. let two kernels X and Y have a import cost of ~10% of the module, we + // assign X to a partition as usual, but when we get to Y, we check if it's + // worth also putting it in Y's partition. + const CostType LargeKernelThreshold = + LargeKernelFactor ? ((ModuleCost / NumParts) * LargeKernelFactor) + : std::numeric_limits::max(); + + std::vector> Partitions; + Partitions.resize(NumParts); + + // Assign a partition to each kernel, and try to keep the partitions more or + // less balanced. We do that through a priority queue sorted in reverse, so we + // can always look at the partition with the least content. + // + // There are some cases where we will be deliberately unbalanced though. + // - Large kernels: we try to merge with existing partitions to reduce code + // duplication. + // - Kernels with indirect or external calls always go in the first partition + // (P0). + auto ComparePartitions = [](const std::pair &a, + const std::pair &b) { + // When two partitions have the same cost, assign to the one with the + // biggest ID first. This allows us to put things in P0 last, because P0 may + // have other stuff added later. + if (a.second == b.second) + return a.first < b.first; + return a.second > b.second; + }; + + // We can't use priority_queue here because we need to be able to access any + // element. This makes this a bit inefficient as we need to sort it again + // everytime we change it, but it's a very small array anyway (likely under 64 + // partitions) so it's a cheap operation. + std::vector> BalancingQueue; + for (unsigned I = 0; I < NumParts; ++I) + BalancingQueue.push_back(std::make_pair(I, 0)); + + // Helper function to... [truncated] ``````````

llvmbot commented 6 months ago

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

(See #83128 to review first commit) This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen. --- Patch is 59.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89245.diff 22 Files Affected: - (modified) llvm/include/llvm/Target/TargetMachine.h (+12) - (modified) llvm/lib/LTO/LTOBackend.cpp (+9-4) - (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp (+733) - (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h (+30) - (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8) - (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h (+4) - (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1) - (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll (+46) - (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize.ll (+37) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll (+54) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependencies.ll (+50) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-duplication.ll (+41) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-external.ll (+43) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-indirect.ll (+69) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll (+42) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll (+44) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-load-balancing.ll (+75) - (added) llvm/test/tools/llvm-split/AMDGPU/kernels-no-dependencies.ll (+39) - (added) llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll (+98) - (added) llvm/test/tools/llvm-split/AMDGPU/lit.local.cfg (+2) - (modified) llvm/tools/llvm-split/CMakeLists.txt (+7) - (modified) llvm/tools/llvm-split/llvm-split.cpp (+76-25) ``````````diff diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h index ceb371bdc73480..48ea3cfe02775b 100644 --- a/llvm/include/llvm/Target/TargetMachine.h +++ b/llvm/include/llvm/Target/TargetMachine.h @@ -418,6 +418,18 @@ class TargetMachine { virtual unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const { return 0; } + + /// Entry point for module splitting. Targets can implement custom module + /// splitting logic, mainly used by LTO for --lto-partitions. + /// + /// \returns `true` if the module was split, `false` otherwise. When `false` + /// is returned, it is assumed that \p ModuleCallback has never been called + /// and \p M has not been modified. + virtual bool splitModule( + Module &M, unsigned NumParts, + function_ref MPart)> ModuleCallback) const { + return false; + } }; /// This class describes a target machine that is implemented with the LLVM diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp index 71e8849dc3cc91..d4b89ede2d7134 100644 --- a/llvm/lib/LTO/LTOBackend.cpp +++ b/llvm/lib/LTO/LTOBackend.cpp @@ -436,8 +436,7 @@ static void splitCodeGen(const Config &C, TargetMachine *TM, unsigned ThreadCount = 0; const Target *T = &TM->getTarget(); - SplitModule( - Mod, ParallelCodeGenParallelismLevel, + const auto HandleModulePartition = [&](std::unique_ptr MPart) { // We want to clone the module in a new context to multi-thread the // codegen. We do it by serializing partition modules to bitcode @@ -469,8 +468,14 @@ static void splitCodeGen(const Config &C, TargetMachine *TM, // Pass BC using std::move to ensure that it get moved rather than // copied into the thread's context. std::move(BC), ThreadCount++); - }, - false); + }; + + // Try target-specific module splitting first, then fallback to the default. + if (!TM->splitModule(Mod, ParallelCodeGenParallelismLevel, + HandleModulePartition)) { + SplitModule(Mod, ParallelCodeGenParallelismLevel, HandleModulePartition, + false); + } // Because the inner lambda (which runs in a worker thread) captures our local // variables, we need to wait for the worker threads to terminate before we diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp new file mode 100644 index 00000000000000..fa47d494f04148 --- /dev/null +++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp @@ -0,0 +1,733 @@ +//===- AMDGPUSplitModule.cpp ----------------------------------------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// +// +/// \file Implements a module splitting algorithm designed to support the +/// FullLTO --lto-partitions option for parallel codegen. This is completely +/// different from the common SplitModule pass, as this system is designed with +/// AMDGPU in mind. +/// +/// The basic idea of this module splitting implementation is the same as +/// SplitModule: load-balance the module's functions across a set of N +/// partitions to allow parallel codegen. However, it does it very +/// differently than the target-agnostic variant: +/// - Kernels are used as the module's "roots". +/// They're known entry points on AMDGPU, and everything else is often +/// internal only. +/// - Each kernel has a set of dependencies, and when a kernel and its +/// dependencies is considered "big", we try to put it in a partition where +/// most dependencies are already imported, to avoid duplicating large +/// amounts of code. +/// - There's special care for indirect calls in order to ensure +/// AMDGPUResourceUsageAnalysis can work correctly. +/// +/// This file also includes a more elaborate logging system to enable +/// users to easily generate logs that (if desired) do not include any value +/// names, in order to not leak information about the source file. +/// Such logs are very helpful to understand and fix potential issues with +/// module splitting. +// +//===----------------------------------------------------------------------===// + +#include "AMDGPUSplitModule.h" +#include "AMDGPUTargetMachine.h" +#include "Utils/AMDGPUBaseInfo.h" +#include "llvm/ADT/DenseMap.h" +#include "llvm/ADT/SmallVector.h" +#include "llvm/ADT/StringExtras.h" +#include "llvm/ADT/StringRef.h" +#include "llvm/Analysis/CallGraph.h" +#include "llvm/Analysis/TargetTransformInfo.h" +#include "llvm/IR/Function.h" +#include "llvm/IR/Instruction.h" +#include "llvm/IR/Module.h" +#include "llvm/IR/User.h" +#include "llvm/IR/Value.h" +#include "llvm/Support/Casting.h" +#include "llvm/Support/Debug.h" +#include "llvm/Support/FileSystem.h" +#include "llvm/Support/Process.h" +#include "llvm/Support/SHA256.h" +#include "llvm/Support/raw_ostream.h" +#include "llvm/Transforms/Utils/Cloning.h" +#include +#include +#include +#include +#include +#include + +using namespace llvm; + +#define DEBUG_TYPE "amdgpu-split-module" + +namespace { + +static cl::opt LargeKernelFactor( + "amdgpu-module-splitting-large-kernel-threshold", cl::init(2.0), cl::Hidden, + cl::desc( + "consider a kernel as large and needing special treatment when it " + "exceeds the average cost of a partition by this factor; e;g. 2.0 " + "means if the kernel and its dependencies is 2 times bigger than " + "an average partition; 0 disables large kernels handling entirely")); + +static cl::opt LargeKernelOverlapForMerge( + "amdgpu-module-splitting-large-kernel-merge-overlap", cl::init(0.8), + cl::Hidden, + cl::desc("defines how much overlap between two large kernel's dependencies " + "is needed to put them in the same partition")); + +static cl::opt NoExternalizeGlobals( + "amdgpu-module-splitting-no-externalize-globals", cl::Hidden, + cl::desc("disables externalization of global variable with local linkage; " + "may cause globals to be duplicated which increases binary size")); + +static cl::opt + LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden, + cl::desc("output directory for AMDGPU module splitting logs")); + +static cl::opt + LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden, + cl::desc("hash value names before printing them in the AMDGPU " + "module splitting logs")); + +using CostType = InstructionCost::CostType; +using PartitionID = unsigned; + +static std::string getName(const Value &V) { + static std::optional HideNames; + if (!HideNames) { + if (LogPrivate.getNumOccurrences()) + HideNames = LogPrivate; + else { + const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE"); + HideNames = (EV.value_or("0") != "0"); + } + } + + if (!*HideNames) + return V.getName().str(); + return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())), + /*LowerCase*/ true); +} + +/// Main logging helper. +/// +/// Logging can be configured by the following environment variable. +/// AMD_SPLIT_MODULE_LOG_DIR= +/// If set, uses as the directory to write logfiles to +/// each time module splitting is used. +/// AMD_SPLIT_MODULE_LOG_PRIVATE +/// If set to anything other than zero, all names are hidden. +/// +/// Both environment variables have corresponding CL options which +/// takes priority over them. +/// +/// Any output printed to the log files is also printed to dbgs() when -debug is +/// used and LLVM_DEBUG is defined. +/// +/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic +/// cannot be removed from the code (by building without debug). This probably +/// has a small performance cost because if some computation/formatting is +/// needed for logging purpose, it may be done everytime only to be ignored +/// by the logger. +/// +/// As this pass only runs once and is not doing anything computationally +/// expensive, this is likely a reasonable trade-off. +/// +/// If some computation should really be avoided when unused, users of the class +/// can check whether any logging will occur by using the bool operator. +/// +/// \code +/// if (SML) { +/// // Executes only if logging to a file or if -debug is available and +/// used. +/// } +/// \endcode +class SplitModuleLogger { +public: + SplitModuleLogger(const Module &M) { + std::string LogDir = LogDirOpt; + if (LogDir.empty()) + LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or(""); + + // No log dir specified means we don't need to log to a file. + // We may still log to dbgs(), though. + if (LogDir.empty()) + return; + + if (!sys::fs::is_directory(LogDir)) { + report_fatal_error("invalid AMDGPU split module log directory: '" + + Twine(LogDir) + "' is not a directory", + /*CrashDiag=*/false); + } + + // If a log directory is specified, create a new file with a unique name in + // that directory. + SmallString<0> FilePath; + int Fd; + std::string LogFile = (LogDir + "/" + "Module-%%-%%-%%-%%-%%-%%-%%.txt"); + if (auto Err = sys::fs::createUniqueFile(LogFile, Fd, FilePath)) { + dbgs() << LogFile << "\n"; + std::string Msg = + "Failed to create log file at '" + LogDir + "': " + Err.message(); + report_fatal_error(StringRef(Msg), + /*CrashDiag=*/false); + } + + FileOS = std::make_unique(Fd, /*shouldClose*/ true); + } + + bool hasLogFile() const { return FileOS != nullptr; } + + raw_ostream &logfile() { + assert(FileOS && "no logfile!"); + return *FileOS; + } + + /// \returns true if this SML will log anything either to a file or dbgs(). + /// Can be used to avoid expensive computations that are ignored when logging + /// is disabled. + operator bool() const { + return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE)); + } + +private: + std::unique_ptr FileOS; +}; + +template +static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) { + static_assert( + !std::is_same_v, + "do not print values to logs directly, use handleName instead!"); + LLVM_DEBUG(dbgs() << Val); + if (SML.hasLogFile()) + SML.logfile() << Val; + return SML; +} + +/// Calculate the cost of each function in \p M +/// \param SML Log Helper +/// \param TM TargetMachine instance used to retrieve TargetTransformInfo. +/// \param M Module to analyze. +/// \param CostMap[out] Resulting Function -> Cost map. +/// \return The module's total cost. +static CostType +calculateFunctionCosts(SplitModuleLogger &SML, const AMDGPUTargetMachine &TM, + Module &M, + DenseMap &CostMap) { + CostType ModuleCost = 0; + CostType KernelCost = 0; + + for (auto &Fn : M) { + if (Fn.isDeclaration()) + continue; + + CostType FnCost = 0; + auto TTI = TM.getTargetTransformInfo(Fn); + + for (auto &BB : Fn) { + for (auto &I : BB) { + auto Cost = + TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize); + assert(Cost != InstructionCost::getMax()); + // Assume expensive if we can't tell the cost of an instruction. + CostType CostVal = + Cost.getValue().value_or(TargetTransformInfo::TCC_Expensive); + assert((FnCost + CostVal) >= FnCost && "Overflow!"); + FnCost += CostVal; + } + } + + assert(FnCost != 0); + + CostMap[&Fn] = FnCost; + assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!"); + ModuleCost += FnCost; + + if (AMDGPU::isKernelCC(&Fn)) + KernelCost += FnCost; + } + + CostType FnCost = (ModuleCost - KernelCost); + SML << "=> Total Module Cost: " << ModuleCost << "\n" + << " => KernelCost: " << KernelCost << " (" + << format("%0.2f", (float(KernelCost) / ModuleCost) * 100) << "%)\n" + << " => FnsCost: " << FnCost << " (" + << format("%0.2f", (float(FnCost) / ModuleCost) * 100) << "%)\n"; + + return ModuleCost; +} + +/// When a kernel or any of its callees performs an indirect call, this function +/// takes over \ref addAllDependencies and adds all potentially callable +/// functions to \p Fns so they can be counted as dependencies of the kernel. +/// +/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the +/// presence of an indirect call, the function's resource usage is the same as +/// the most expensive function in the module. +/// \param M The module. +/// \param Fns[out] Resulting list of functions. +static void addAllIndirectCallDependencies(const Module &M, + DenseSet &Fns) { + for (const auto &Fn : M) { + if (!Fn.isDeclaration() && !AMDGPU::isEntryFunctionCC(Fn.getCallingConv())) + Fns.insert(&Fn); + } +} + +/// Adds the functions that \p Fn may call to \p Fns, then recurses into each +/// callee until all reachable functions have been gathered. +/// +/// \param SML Log Helper +/// \param CG Call graph for \p Fn's module. +/// \param Fn Current function to look at. +/// \param Fns[out] Resulting list of functions. +/// \param HadIndirectCall[out] Set to true if an indirect call was seen at some +/// point, either in \p Fn or in one of the function it calls. When that +/// happens, we fall back to adding all callable functions inside \p Fn's module +/// to \p Fns. +/// \param HadExternalCall[out] Set to true if a call to an external function +/// was seen at some point, either in \p Fn or in one of the function it calls +static void addAllDependencies(SplitModuleLogger &SML, const CallGraph &CG, + const Function &Fn, + DenseSet &Fns, + bool &HadIndirectCall, bool &HadExternalCall) { + assert(!Fn.isDeclaration()); + + const Module &M = *Fn.getParent(); + SmallVector WorkList({&Fn}); + while (!WorkList.empty()) { + const auto &CurFn = *WorkList.pop_back_val(); + + // Scan for an indirect call. If such a call is found, we have to + // conservatively assume this can call all non-entrypoint functions in the + // module. + for (const auto &BB : CurFn) { + for (const auto &I : BB) { + const auto *CB = dyn_cast(&I); + if (!CB || !CB->isIndirectCall()) + continue; + + SML << "Indirect call detected in " << getName(CurFn) + << " - treating all non-entrypoint functions as " + "potential dependencies\n"; + + // TODO: Print an ORE as well ? + addAllIndirectCallDependencies(M, Fns); + HadIndirectCall = true; + return; + } + } + + for (auto &CGEntry : *CG[&CurFn]) { + auto *Callee = CGEntry.second->getFunction(); + if (!Callee) + continue; + + assert(!AMDGPU::isKernelCC(Callee)); + + if (Callee->isDeclaration()) + continue; + + if (Callee->hasExternalLinkage()) + HadExternalCall = true; + + auto [It, Inserted] = Fns.insert(Callee); + if (Inserted) + WorkList.push_back(Callee); + } + } +} + +/// Contains information about a kernel and its dependencies. +struct KernelWithDependencies { + KernelWithDependencies(SplitModuleLogger &SML, CallGraph &CG, + const DenseMap &FnCosts, + const Function *Fn) + : Fn(Fn) { + addAllDependencies(SML, CG, *Fn, Dependencies, HasIndirectCall, + HasExternalCall); + TotalCost = FnCosts.at(Fn); + for (const auto *Dep : Dependencies) + TotalCost += FnCosts.at(Dep); + } + + const Function *Fn = nullptr; + DenseSet Dependencies; + /// Whether \p Fn or any of its \ref Dependencies contains an indirect call. + bool HasIndirectCall = false; + /// Whether \p Fn or any of its \ref Dependencies contains a call to a + /// function with external linkage. + bool HasExternalCall = false; + + CostType TotalCost = 0; + + /// \returns true if this kernel and its dependencies can be considered large + /// according to \p Threshold. + bool isLarge(CostType Threshold) const { + return TotalCost > Threshold && !Dependencies.empty(); + } +}; + +/// Calculates how much overlap there is between \p A and \p B. +/// \return A number between 0.0 and 1.0, where 1.0 means A == B and 0.0 means A +/// and B have no shared elements. Kernels do not count in overlap calculation. +static float calculateOverlap(const DenseSet &A, + const DenseSet &B) { + DenseSet Total; + for (const auto *F : A) { + if (!AMDGPU::isKernelCC(F)) + Total.insert(F); + } + + if (Total.empty()) + return 0.0f; + + unsigned NumCommon = 0; + for (const auto *F : B) { + if (AMDGPU::isKernelCC(F)) + continue; + + auto [It, Inserted] = Total.insert(F); + if (!Inserted) + ++NumCommon; + } + + return float(NumCommon) / Total.size(); +} + +/// Performs all of the partitioning work on \p M. +/// \param SML Log Helper +/// \param M Module to partition. +/// \param NumParts Number of partitions to create. +/// \param ModuleCost Total cost of all functions in \p M. +/// \param FnCosts Map of Function -> Cost +/// \param WorkList Kernels and their dependencies to process in order. +/// \returns The created partitions (a vector of size \p NumParts ) +static std::vector> +doPartitioning(SplitModuleLogger &SML, Module &M, unsigned NumParts, + CostType ModuleCost, + const DenseMap &FnCosts, + const SmallVector &WorkList) { + + SML << "\n--Partitioning Starts--\n"; + + // Calculate a "large kernel threshold". When more than one kernel's total + // import cost exceeds this value, we will try to merge it with other, + // similarly large kernels. + // + // e.g. let two kernels X and Y have a import cost of ~10% of the module, we + // assign X to a partition as usual, but when we get to Y, we check if it's + // worth also putting it in Y's partition. + const CostType LargeKernelThreshold = + LargeKernelFactor ? ((ModuleCost / NumParts) * LargeKernelFactor) + : std::numeric_limits::max(); + + std::vector> Partitions; + Partitions.resize(NumParts); + + // Assign a partition to each kernel, and try to keep the partitions more or + // less balanced. We do that through a priority queue sorted in reverse, so we + // can always look at the partition with the least content. + // + // There are some cases where we will be deliberately unbalanced though. + // - Large kernels: we try to merge with existing partitions to reduce code + // duplication. + // - Kernels with indirect or external calls always go in the first partition + // (P0). + auto ComparePartitions = [](const std::pair &a, + const std::pair &b) { + // When two partitions have the same cost, assign to the one with the + // biggest ID first. This allows us to put things in P0 last, because P0 may + // have other stuff added later. + if (a.second == b.second) + return a.first < b.first; + return a.second > b.second; + }; + + // We can't use priority_queue here because we need to be able to access any + // element. This makes this a bit inefficient as we need to sort it again + // everytime we change it, but it's a very small array anyway (likely under 64 + // partitions) so it's a cheap operation. + std::vector> BalancingQueue; + for (unsigned I = 0; I < NumParts; ++I) + BalancingQueue.push_back(std::make_pair(I, 0)); + + // Helper function to... [truncated] ``````````

arsenm commented 6 months ago

Can you add some tests with aliases? I'm not sure they work correctly in the call graph

gandhi56 commented 6 months ago

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

jrbyrnes commented 6 months ago

These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning

I ran into https://llvm.org/docs/CodingStandards.html#beware-of-non-determinism-due-to-ordering-of-pointers a while ago, and thought I'd point it out since I see this work uses sets/maps with pointer keys. Not sure if the iteration on these effects test output, but something to consider..

Pierre-vh commented 5 months ago

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

I'm not able to reproduce it, can you create a testcase that only depends on this pass? I had to modify the tests a lot to make them work so it possibly lost what makes them non-deterministic. Also I see you're using %u but I don't know what that's supposed to do, I replaced it with %t to make it work.

Pierre-vh commented 5 months ago

Note: I added the isAddressTaken check for the indirect calls dependency collection. I think that's a trivial addition to narrow down the set of indirectly callable functions.

gandhi56 commented 5 months ago

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

Never mind, I had some errors in the RUN lines in my tests. This issue does not exist.

github-actions[bot] commented 5 months ago

:white_check_mark: With the latest revision this PR passed the C/C++ code formatter.

Pierre-vh commented 5 months ago

My understanding is that --lto-partitions doesn't work because the AMDGPU backend needs to consume the SCC in order. I'm assuming this patch maintains that logic but splits the independent kernel calls up to they can be done in parallel?

Exactly. If a function is called directly or indirectly by a kernel, it stays in that kernel's module.

I've thought that the true solution to resolving this would just be emitting the kernel resource usage inside of ld.lld. Presumably that would require emitting resource usage per-function in parallel and then the callgraph information (There's some small support for this already). Then ld.lld would need to traverse the callgraph to get the diameter of said graph. However I'm assuming that'd take more effort to define an actual linking ABI.

Indeed, and this is where I started looking as well (so we could just enable thinLTO), but it's hard to get right and this solution was by far the easiest and the most realistic one to do in the short term.

vitalybuka commented 5 months ago

Can this be fixed or reverted https://lab.llvm.org/buildbot/#/builders/85/builds/24181 ?

llvm / llvm-project

[AMDGPU] Add AMDGPU-specific module splitting #89245