Missing CPU features attributes on dispatch functions lead to UB / missed target instructions

bjacob commented 8 months ago

Testcase: just a i8 x i8 -> i32 matmul:

func.func @matmul_dynamic(%lhs: tensor<?x?xi8>, %rhs: tensor<?x?xi8>, %acc: tensor<?x?xi32>) -> tensor<?x?xi32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xi8>, tensor<?x?xi8>) outs(%acc: tensor<?x?xi32>) -> tensor<?x?xi32>
  return %result: tensor<?x?xi32>
}

Reproduce:

tools/iree-compile \
  ~/matmul_i8.mlir -o /tmp/a.vmfb \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-enable-ukernels=all \
  --iree-hal-dump-executable-intermediates-to=/tmp \
  -mlir-disable-threading \
  -mlir-print-ir-after-all \
  2>/tmp/log

Inspection of the generated assembly /tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.s shows that baseline AVX-512 code is generated (VPMADDWD) instead of the expected AVX-512-VNNI code (VPDPWSSD):

matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x2_i8xi8xi32:
[...]
    vshufi64x2  $27, %zmm16, %zmm16, %zmm19
    vpmaddwd    %zmm16, %zmm21, %zmm24
    vpmaddwd    %zmm17, %zmm21, %zmm26
    vpmaddwd    %zmm18, %zmm21, %zmm25
    vpmaddwd    %zmm19, %zmm21, %zmm21
[...]

Why? The dumped intermediates show that all the way to the post-linking optimized IR (/tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.optimized.ll), it was the expected AVX-512-VNNI intrinsic function:

define internal noundef i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x2_i8xi8xi32(ptr noalias nocapture nonnull readonly align 16 %0, ptr noalias nocapture nonnull readonly align 16 %1, ptr noalias nocapture nonnull readonly align 16 %2) #1 !dbg !90 {
[...]
  %358 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %334, <16 x i32> %354, <16 x i32> %347), !dbg !91
  %359 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %333, <16 x i32> %354, <16 x i32> %348), !dbg !91
  %360 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %332, <16 x i32> %354, <16 x i32> %349), !dbg !91
  %361 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %331, <16 x i32> %354, <16 x i32> %350), !dbg !91
  %362 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %330, <16 x i32> %355, <16 x i32> %347), !dbg !91
  %363 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %329, <16 x i32> %355, <16 x i32> %348), !dbg !91
[...]

But wait, what is that attribute #1 on that function? Does it have the required CPU feature enabled? Nope:

attributes #1 = { nofree norecurse nosync nounwind "frame-pointer"="all" "hot" "no-builtins" "nonlazybind" }

So our code here is Undefined Behavior, and indeed, while initially minimizing it with llc, I did run into should-not-get-here crashes in x86 instruction selection. And in our current e2e IREE use case, the Undefined Behavior, while not crashing or affecting correctness, is still causing us to miss the intended VNNI instruction.

"Of course" this dispatch function doesn't have the required +avx512vnni CPU feature attribute, since we never put it there. The only functions that have the +avx512vnni CPU feature attribute are the ukernel internal VNNI implementation functions, which are compiled with this CPU feature enabled in the first place.

I guess I was expecting the attribute to be propagated from callee to caller as the VNNI inner tile function gets inlined first into iree_uk_mmt4d and then into the dispatch function. It's not.

How do we resolve that in a way that doesn't violate the design with target specialization in LLVMCPUTarget ? @benvanik

benvanik commented 8 months ago

if the LLVM inliner is doing the inlining and not propagating the flag, that feels like an LLVM bug that needs to be fixed there - or we'd need to hook the inliner somehow and do the propagation ourselves

bjacob commented 8 months ago

Oh right, that makes sense. I'll start by minimizing the .linked.ll.

bjacob commented 8 months ago

When I run llc on the .linked.ll, even with -O3, I get no inlining at all (of the tile functions with CPU feature attributes, into the callers without these attributes).

So the inlining behavior of iree-compile here is a behavior departure from llc.

Incidentally, https://github.com/llvm/llvm-project/pull/83820 just went in and sheds light on the semantics of inlining vs CPU features on x86: "The caller features must be a superset of the callee features."

Notice that this logic (which says no inlining in this case) exists inside the X86 Target, while our logic (which does inline this case) is using a more like middle-end pass manager, https://github.com/openxla/iree/blob/7782a414ea473c59f6d7a882cb510690ed666c79/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMIRPasses.cpp#L48 . I checked llc source code and it does not use that.

Ultimately, the perfect inlining (of say that AVX-512-VNNI tile function into the dispatch function, and DCE'ing of everything else) is only possible if we are actually specializing the code for this specific CPU feature. So either we accept that and then the easy fix is to add the target machine's CPU features to the dispatch function, or we don't accept that and then we need to accept that we won't get the inlining and the subsequent optimizations? I'd love to hear that there's a third way but I don't see it right now ? @benvanik

EDIT - is this where fancy new multiversioning stuff enters the picture ?

benvanik commented 8 months ago

yeah, I think the idea is we end up with one function per specialization (ukernel variant) and then the main exported function calls those functions but does not expect (or want) them to be inlined and is left generic. the refactoring @MaheshRavishankar is doing to make multiple functions to work should make this possible.

bjacob commented 8 months ago

Some progress of sorts. I put together this patch to try locally perfectly aligning the caller and callee target-features:

diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
index 46d1978d00..c1da9812ab 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
@@ -9,7 +9,9 @@
 #include "iree/builtins/ukernel/ukernel_bitcode.h"
 #include "iree/compiler/Codegen/Utils/Utils.h"
 #include "llvm/Bitcode/BitcodeReader.h"
+#include "llvm/IR/Attributes.h"
 #include "llvm/Support/MemoryBufferRef.h"
+#include "mlir/IR/Builders.h"
 #include "mlir/Support/LLVM.h"

 namespace mlir::iree_compiler::IREE::HAL {
@@ -57,6 +59,11 @@ loadUKernelBitcode(llvm::TargetMachine *targetMachine,
   // can result in a large penalty in both performance and code size.
   for (auto &func : module.get()->functions()) {
     func.addFnAttr(llvm::Attribute::AlwaysInline);
+    llvm::AttrBuilder builder(context);
+    func.removeFnAttr("target-cpu");
+    func.removeFnAttr("target-features");
+    func.addFnAttr("target-cpu", targetMachine->getTargetCPU());
+    func.addFnAttr("target-features", targetMachine->getTargetFeatureString());
   }
   return module;
 }
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
index f3b5311921..8c328e4176 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
@@ -371,6 +371,12 @@ public:
       // Our dispatches are all hot - that's kind of the point.
       // This may favor more aggressive optimizations.
       func.addFnAttr("hot");
+
+      func.addFnAttr("target-cpu", executableBuilder.getStringAttr(
+                                       targetMachine->getTargetCPU()));
+      func.addFnAttr("target-features",
+                     executableBuilder.getStringAttr(
+                         targetMachine->getTargetFeatureString()));
     }

With that, I still get exactly the same problem with iree-compile's output, the vpmaddwd instruction instead of the vpdpwssd, but now this isn't UB anymore, as far as I can see. llc now processes the .optimized.ll without crashing and produces the same result, the unexpected vpmaddwd instruction, despite having the vpdpwssd intrinsics and now (unlike before) having all the right CPU feature attributes. So at least I can try to minimize that .optimized.ll now with llc. Before, I couldn't, due to the crashes.

MaheshRavishankar commented 8 months ago

This is the PR Ben was referring to https://github.com/openxla/iree/pull/16665

iree-org / iree

Missing CPU features attributes on dispatch functions lead to UB / missed target instructions #16670