llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.82k stars 11.91k forks source link

"Interference" assection in SplitKit - bisected to a SCEV change and isolated to AMDGPU division expansion #87721

Closed krzysz00 closed 5 months ago

krzysz00 commented 7 months ago

The issue

LLC crashes as follows on an input attached below

llc: /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1662: void llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, SlotIndex, unsigned int, SlotIndex): Assertion `(!LeaveBefore || Idx <= LeaveBefore) && "Interference"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll -o -
1.      Running pass 'CallGraph Pass Manager' on module './reproducer.ll'.
2.      Running pass 'Greedy Register Allocator' on function '@rock_gemm'
 [...abort...]
#13 0x00000000034492f7 llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, llvm::SlotIndex, unsigned int, llvm::SlotIndex) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1668:5
#14 0x00000000033a1630 llvm::RAGreedy::splitAroundRegion(llvm::LiveRangeEdit&, llvm::ArrayRef<unsigned int>) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:11
#15 0x00000000033a263d llvm::RAGreedy::doRegionSplit(llvm::LiveInterval const&, unsigned int, bool, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:3
#16 0x00000000033a1eff llvm::RAGreedy::tryRegionSplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1093:1
#17 0x00000000033a6b01 llvm::RAGreedy::trySplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>> const&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1827:26
#18 0x00000000033a8ce5 llvm::RAGreedy::selectOrSplitImpl(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>>&, llvm::SmallVector<std::pair<llvm::LiveInterval const*, llvm::MCRegister>, 8u>&, unsigned int) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2476:24
#19 0x00000000033a9337 llvm::RAGreedy::selectOrSplit(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2151:7
#20 0x000000000337bd85 llvm::RegAllocBase::allocatePhysRegs() /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocBase.cpp:114:9
#21 0x00000000033ad3cd llvm::RAGreedy::runOnMachineFunction(llvm::MachineFunction&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2772:3
[...]

A git bisect run isolated this crash to only happening after #74467 .

While full reproduction information and variant inputs/settings that do or don't cause the crash to occur are provided below, I can report that the flag -amdgpu-codegenprepare-disable-idiv-expansion=true removes the failure.

Reproduction files

All of these files are opt -O3 -mtriple=amdgcn-amd-amdhsa output.

I apologize in advance for the lack of a smaller test case, as bugpoint didn't have much luck with this one.

reproducer.ll.txt is the input that triggers the crash. It is a matrix multiplication implementation.

fewer-batches-passing.ll.txt is that same code but with a lower batch size specified. That is, the input IR was identical to the failing case, but the statically-known (and annotated as a !range) number of workgroups differed between these two files.

In relevant part, the diff between those two files is

--- reproducer.ll       2024-04-04 21:13:02.778679418 +0000
+++ fewer-batches-passing.ll    2024-04-04 21:14:50.335567529 +0000
@@ -5,29 +5,28 @@ target datalayout = "e-p:64:64-p1:64:64-                               @__wg_rock_gemm_0 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64                                                                                       @__wg_rock_gemm_1 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64

-define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(805306368) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(100663296) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(301989888) %2) local_unnamed_addr #0 !reqd_work_group_size !0 {
+define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(125829120) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(15728640) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(47185920) %2) local_unnamed_addr #0 !reqd_work_group_size !0 {
 .preheader21.preheader:
   %3 = tail call i32 @llvm.amdgcn.workgroup.id.x(), !range !1
   %.fr = freeze i32 %3
-  %.lhs.trunc = trunc i32 %.fr to i16
-  %4 = udiv i16 %.lhs.trunc, 24
-  %5 = mul i16 %4, 24
-  %.decomposed = sub i16 %.lhs.trunc, %5
-  %.zext17 = zext nneg i16 %.decomposed to i32
-  %.cmp = icmp ugt i16 %.decomposed, 21
+  %.lhs.trunc = trunc i32 %.fr to i8
+  %4 = udiv i8 %.lhs.trunc, 24
+  %5 = mul i8 %4, 24
+  %.decomposed = sub i8 %.lhs.trunc, %5
+  %.zext17 = zext nneg i8 %.decomposed to i32
+  %.cmp = icmp ugt i8 %.decomposed, 21
   %6 = select i1 %.cmp, i32 11, i32 0
   %7 = sub nuw nsw i32 12, %6
   %8 = tail call i32 @llvm.umin.i32(i32 %7, i32 11)
-  %.lhs.trunc18 = trunc i16 %.decomposed to i8
   %.rhs.trunc = trunc i32 %8 to i8
-  %9 = urem i8 %.lhs.trunc18, %.rhs.trunc
+  %9 = urem i8 %.decomposed, %.rhs.trunc
@@ -1633,7 +1632,7 @@ attributes #4 = { convergent mustprogres
 attributes #5 = { nounwind }

 !0 = !{i32 256, i32 1, i32 1}
-!1 = !{i32 0, i32 1536}
+!1 = !{i32 0, i32 240}
 !2 = !{i32 0, i32 256}
 !3 = !{}
 !4 = !{!5}

reproducer-barriers-removed.ll.txt is reproducer.ll with the call void asm statements removed. This variant also does not crash.

Steps to reproduce

(The -mattr inputs are kept to mach the original source of the bug)

llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll

This will crash as seen above.

However,

llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll  -amdgpu-codegenprepare-disable-idiv-expansion=true

will not crash

Similarly, replacing reproducer.ll with either of the two variant files will not trigger the bug.

(Finally, adding -global-isel will also avoid the crash)

llvmbot commented 7 months ago

@llvm/issue-subscribers-backend-amdgpu

Author: Krzysztof Drewniak (krzysz00)

## The issue LLC crashes as follows on an input attached below ``` llc: /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1662: void llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, SlotIndex, unsigned int, SlotIndex): Assertion `(!LeaveBefore || Idx <= LeaveBefore) && "Interference"' failed. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll -o - 1. Running pass 'CallGraph Pass Manager' on module './reproducer.ll'. 2. Running pass 'Greedy Register Allocator' on function '@rock_gemm' [...abort...] #13 0x00000000034492f7 llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, llvm::SlotIndex, unsigned int, llvm::SlotIndex) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1668:5 #14 0x00000000033a1630 llvm::RAGreedy::splitAroundRegion(llvm::LiveRangeEdit&, llvm::ArrayRef<unsigned int>) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:11 #15 0x00000000033a263d llvm::RAGreedy::doRegionSplit(llvm::LiveInterval const&, unsigned int, bool, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:3 #16 0x00000000033a1eff llvm::RAGreedy::tryRegionSplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1093:1 #17 0x00000000033a6b01 llvm::RAGreedy::trySplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>> const&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1827:26 #18 0x00000000033a8ce5 llvm::RAGreedy::selectOrSplitImpl(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>>&, llvm::SmallVector<std::pair<llvm::LiveInterval const*, llvm::MCRegister>, 8u>&, unsigned int) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2476:24 #19 0x00000000033a9337 llvm::RAGreedy::selectOrSplit(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2151:7 #20 0x000000000337bd85 llvm::RegAllocBase::allocatePhysRegs() /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocBase.cpp:114:9 #21 0x00000000033ad3cd llvm::RAGreedy::runOnMachineFunction(llvm::MachineFunction&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2772:3 [...] ``` A `git bisect` run isolated this crash to only happening after #74467 . While full reproduction information and variant inputs/settings that do or don't cause the crash to occur are provided below, I can report that the flag `-amdgpu-codegenprepare-disable-idiv-expansion=true` removes the failure. ## Reproduction files All of these files are `opt -O3 -mtriple=amdgcn-amd-amdhsa` output. I apologize in advance for the lack of a smaller test case, as `bugpoint` didn't have much luck with this one. [reproducer.ll.txt](https://github.com/llvm/llvm-project/files/14877595/reproducer.ll.txt) is the input that triggers the crash. It is a matrix multiplication implementation. [fewer-batches-passing.ll.txt](https://github.com/llvm/llvm-project/files/14877637/fewer-batches-passing.ll.txt) is that same code but with a lower batch size specified. That is, the input IR was identical to the failing case, but the statically-known (and annotated as a `!range`) number of workgroups differed between these two files. In relevant part, the diff between those two files is ``` --- reproducer.ll 2024-04-04 21:13:02.778679418 +0000 +++ fewer-batches-passing.ll 2024-04-04 21:14:50.335567529 +0000 @@ -5,29 +5,28 @@ target datalayout = "e-p:64:64-p1:64:64- @__wg_rock_gemm_0 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64 @__wg_rock_gemm_1 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64 -define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(805306368) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(100663296) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(301989888) %2) local_unnamed_addr #0 !reqd_work_group_size !0 { +define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(125829120) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(15728640) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(47185920) %2) local_unnamed_addr #0 !reqd_work_group_size !0 { .preheader21.preheader: %3 = tail call i32 @llvm.amdgcn.workgroup.id.x(), !range !1 %.fr = freeze i32 %3 - %.lhs.trunc = trunc i32 %.fr to i16 - %4 = udiv i16 %.lhs.trunc, 24 - %5 = mul i16 %4, 24 - %.decomposed = sub i16 %.lhs.trunc, %5 - %.zext17 = zext nneg i16 %.decomposed to i32 - %.cmp = icmp ugt i16 %.decomposed, 21 + %.lhs.trunc = trunc i32 %.fr to i8 + %4 = udiv i8 %.lhs.trunc, 24 + %5 = mul i8 %4, 24 + %.decomposed = sub i8 %.lhs.trunc, %5 + %.zext17 = zext nneg i8 %.decomposed to i32 + %.cmp = icmp ugt i8 %.decomposed, 21 %6 = select i1 %.cmp, i32 11, i32 0 %7 = sub nuw nsw i32 12, %6 %8 = tail call i32 @llvm.umin.i32(i32 %7, i32 11) - %.lhs.trunc18 = trunc i16 %.decomposed to i8 %.rhs.trunc = trunc i32 %8 to i8 - %9 = urem i8 %.lhs.trunc18, %.rhs.trunc + %9 = urem i8 %.decomposed, %.rhs.trunc @@ -1633,7 +1632,7 @@ attributes #4 = { convergent mustprogres attributes #5 = { nounwind } !0 = !{i32 256, i32 1, i32 1} -!1 = !{i32 0, i32 1536} +!1 = !{i32 0, i32 240} !2 = !{i32 0, i32 256} !3 = !{} !4 = !{!5} ``` [reproducer-barriers-removed.ll.txt](https://github.com/llvm/llvm-project/files/14877651/reproducer-barriers-removed.ll.txt) is `reproducer.ll` with the `call void asm` statements removed. This variant also does not crash. ## Steps to reproduce (The `-mattr` inputs are kept to mach the original source of the bug) ``` llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll ``` This will crash as seen above. However, ``` llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll -amdgpu-codegenprepare-disable-idiv-expansion=true ``` will not crash Similarly, replacing `reproducer.ll` with either of the two variant files will not trigger the bug. (Finally, adding `-global-isel` will also avoid the crash)
arsenm commented 7 months ago

Just about anything that perturbs the IR will hide anything in register allocation, which is essentially random. You can get further by reducing MIR, which you can do with llvm-reduce

krzysz00 commented 5 months ago

Closed as vague ticket that doesn't help anyone

arsenm commented 5 months ago

Closed as vague ticket that doesn't help anyone

It's useful if it reproduces still. Also the trick with these is to reduce all the way to minimal -start-before/-stop-after MIR sample