Open bjacob opened 8 months ago
if the LLVM inliner is doing the inlining and not propagating the flag, that feels like an LLVM bug that needs to be fixed there - or we'd need to hook the inliner somehow and do the propagation ourselves
Oh right, that makes sense. I'll start by minimizing the .linked.ll.
When I run llc
on the .linked.ll
, even with -O3
, I get no inlining at all (of the tile functions with CPU feature attributes, into the callers without these attributes).
So the inlining behavior of iree-compile
here is a behavior departure from llc
.
Incidentally, https://github.com/llvm/llvm-project/pull/83820 just went in and sheds light on the semantics of inlining vs CPU features on x86: "The caller features must be a superset of the callee features."
Notice that this logic (which says no inlining in this case) exists inside the X86 Target, while our logic (which does inline this case) is using a more like middle-end pass manager, https://github.com/openxla/iree/blob/7782a414ea473c59f6d7a882cb510690ed666c79/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMIRPasses.cpp#L48 . I checked llc
source code and it does not use that.
Ultimately, the perfect inlining (of say that AVX-512-VNNI tile function into the dispatch function, and DCE'ing of everything else) is only possible if we are actually specializing the code for this specific CPU feature. So either we accept that and then the easy fix is to add the target machine's CPU features to the dispatch function, or we don't accept that and then we need to accept that we won't get the inlining and the subsequent optimizations? I'd love to hear that there's a third way but I don't see it right now ? @benvanik
EDIT - is this where fancy new multiversioning stuff enters the picture ?
yeah, I think the idea is we end up with one function per specialization (ukernel variant) and then the main exported function calls those functions but does not expect (or want) them to be inlined and is left generic. the refactoring @MaheshRavishankar is doing to make multiple functions to work should make this possible.
Some progress of sorts. I put together this patch to try locally perfectly aligning the caller and callee target-features:
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
index 46d1978d00..c1da9812ab 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
@@ -9,7 +9,9 @@
#include "iree/builtins/ukernel/ukernel_bitcode.h"
#include "iree/compiler/Codegen/Utils/Utils.h"
#include "llvm/Bitcode/BitcodeReader.h"
+#include "llvm/IR/Attributes.h"
#include "llvm/Support/MemoryBufferRef.h"
+#include "mlir/IR/Builders.h"
#include "mlir/Support/LLVM.h"
namespace mlir::iree_compiler::IREE::HAL {
@@ -57,6 +59,11 @@ loadUKernelBitcode(llvm::TargetMachine *targetMachine,
// can result in a large penalty in both performance and code size.
for (auto &func : module.get()->functions()) {
func.addFnAttr(llvm::Attribute::AlwaysInline);
+ llvm::AttrBuilder builder(context);
+ func.removeFnAttr("target-cpu");
+ func.removeFnAttr("target-features");
+ func.addFnAttr("target-cpu", targetMachine->getTargetCPU());
+ func.addFnAttr("target-features", targetMachine->getTargetFeatureString());
}
return module;
}
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
index f3b5311921..8c328e4176 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
@@ -371,6 +371,12 @@ public:
// Our dispatches are all hot - that's kind of the point.
// This may favor more aggressive optimizations.
func.addFnAttr("hot");
+
+ func.addFnAttr("target-cpu", executableBuilder.getStringAttr(
+ targetMachine->getTargetCPU()));
+ func.addFnAttr("target-features",
+ executableBuilder.getStringAttr(
+ targetMachine->getTargetFeatureString()));
}
With that, I still get exactly the same problem with iree-compile
's output, the vpmaddwd
instruction instead of the vpdpwssd
, but now this isn't UB anymore, as far as I can see. llc
now processes the .optimized.ll
without crashing and produces the same result, the unexpected vpmaddwd
instruction, despite having the vpdpwssd
intrinsics and now (unlike before) having all the right CPU feature attributes. So at least I can try to minimize that .optimized.ll
now with llc
. Before, I couldn't, due to the crashes.
This is the PR Ben was referring to https://github.com/openxla/iree/pull/16665
Testcase: just a
i8 x i8 -> i32
matmul:Reproduce:
Inspection of the generated assembly
/tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.s
shows that baseline AVX-512 code is generated (VPMADDWD) instead of the expected AVX-512-VNNI code (VPDPWSSD):Why? The dumped intermediates show that all the way to the post-linking optimized IR (
/tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.optimized.ll
), it was the expected AVX-512-VNNI intrinsic function:But wait, what is that attribute
#1
on that function? Does it have the required CPU feature enabled? Nope:So our code here is Undefined Behavior, and indeed, while initially minimizing it with
llc
, I did run into should-not-get-here crashes in x86 instruction selection. And in our current e2e IREE use case, the Undefined Behavior, while not crashing or affecting correctness, is still causing us to miss the intended VNNI instruction."Of course" this dispatch function doesn't have the required
+avx512vnni
CPU feature attribute, since we never put it there. The only functions that have the+avx512vnni
CPU feature attribute are the ukernel internal VNNI implementation functions, which are compiled with this CPU feature enabled in the first place.I guess I was expecting the attribute to be propagated from callee to caller as the VNNI inner tile function gets inlined first into
iree_uk_mmt4d
and then into the dispatch function. It's not.How do we resolve that in a way that doesn't violate the design with target specialization in LLVMCPUTarget ? @benvanik