"Interference" assection in SplitKit - bisected to a SCEV change and isolated to AMDGPU division expansion

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

Other

28.82k stars 11.91k forks source link

The issue

LLC crashes as follows on an input attached below

llc: /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1662: void llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, SlotIndex, unsigned int, SlotIndex): Assertion `(!LeaveBefore || Idx <= LeaveBefore) && "Interference"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll -o -
1.      Running pass 'CallGraph Pass Manager' on module './reproducer.ll'.
2.      Running pass 'Greedy Register Allocator' on function '@rock_gemm'
 [...abort...]
#13 0x00000000034492f7 llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, llvm::SlotIndex, unsigned int, llvm::SlotIndex) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1668:5
#14 0x00000000033a1630 llvm::RAGreedy::splitAroundRegion(llvm::LiveRangeEdit&, llvm::ArrayRef<unsigned int>) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:11
#15 0x00000000033a263d llvm::RAGreedy::doRegionSplit(llvm::LiveInterval const&, unsigned int, bool, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:3
#16 0x00000000033a1eff llvm::RAGreedy::tryRegionSplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1093:1
#17 0x00000000033a6b01 llvm::RAGreedy::trySplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>> const&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1827:26
#18 0x00000000033a8ce5 llvm::RAGreedy::selectOrSplitImpl(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>>&, llvm::SmallVector<std::pair<llvm::LiveInterval const*, llvm::MCRegister>, 8u>&, unsigned int) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2476:24
#19 0x00000000033a9337 llvm::RAGreedy::selectOrSplit(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2151:7
#20 0x000000000337bd85 llvm::RegAllocBase::allocatePhysRegs() /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocBase.cpp:114:9
#21 0x00000000033ad3cd llvm::RAGreedy::runOnMachineFunction(llvm::MachineFunction&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2772:3
[...]

A git bisect run isolated this crash to only happening after #74467 .

While full reproduction information and variant inputs/settings that do or don't cause the crash to occur are provided below, I can report that the flag -amdgpu-codegenprepare-disable-idiv-expansion=true removes the failure.

Reproduction files

All of these files are opt -O3 -mtriple=amdgcn-amd-amdhsa output.

I apologize in advance for the lack of a smaller test case, as bugpoint didn't have much luck with this one.

reproducer.ll.txt is the input that triggers the crash. It is a matrix multiplication implementation.

fewer-batches-passing.ll.txt is that same code but with a lower batch size specified. That is, the input IR was identical to the failing case, but the statically-known (and annotated as a !range) number of workgroups differed between these two files.

In relevant part, the diff between those two files is

--- reproducer.ll       2024-04-04 21:13:02.778679418 +0000
+++ fewer-batches-passing.ll    2024-04-04 21:14:50.335567529 +0000
@@ -5,29 +5,28 @@ target datalayout = "e-p:64:64-p1:64:64-                               @__wg_rock_gemm_0 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64                                                                                       @__wg_rock_gemm_1 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64

-define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(805306368) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(100663296) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(301989888) %2) local_unnamed_addr #0 !reqd_work_group_size !0 {
+define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(125829120) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(15728640) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(47185920) %2) local_unnamed_addr #0 !reqd_work_group_size !0 {
 .preheader21.preheader:
   %3 = tail call i32 @llvm.amdgcn.workgroup.id.x(), !range !1
   %.fr = freeze i32 %3
-  %.lhs.trunc = trunc i32 %.fr to i16
-  %4 = udiv i16 %.lhs.trunc, 24
-  %5 = mul i16 %4, 24
-  %.decomposed = sub i16 %.lhs.trunc, %5
-  %.zext17 = zext nneg i16 %.decomposed to i32
-  %.cmp = icmp ugt i16 %.decomposed, 21
+  %.lhs.trunc = trunc i32 %.fr to i8
+  %4 = udiv i8 %.lhs.trunc, 24
+  %5 = mul i8 %4, 24
+  %.decomposed = sub i8 %.lhs.trunc, %5
+  %.zext17 = zext nneg i8 %.decomposed to i32
+  %.cmp = icmp ugt i8 %.decomposed, 21
   %6 = select i1 %.cmp, i32 11, i32 0
   %7 = sub nuw nsw i32 12, %6
   %8 = tail call i32 @llvm.umin.i32(i32 %7, i32 11)
-  %.lhs.trunc18 = trunc i16 %.decomposed to i8
   %.rhs.trunc = trunc i32 %8 to i8
-  %9 = urem i8 %.lhs.trunc18, %.rhs.trunc
+  %9 = urem i8 %.decomposed, %.rhs.trunc
@@ -1633,7 +1632,7 @@ attributes #4 = { convergent mustprogres
 attributes #5 = { nounwind }

 !0 = !{i32 256, i32 1, i32 1}
-!1 = !{i32 0, i32 1536}
+!1 = !{i32 0, i32 240}
 !2 = !{i32 0, i32 256}
 !3 = !{}
 !4 = !{!5}

reproducer-barriers-removed.ll.txt is reproducer.ll with the call void asm statements removed. This variant also does not crash.

Steps to reproduce

(The -mattr inputs are kept to mach the original source of the bug)

llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll

This will crash as seen above.

However,

llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll  -amdgpu-codegenprepare-disable-idiv-expansion=true

will not crash

Similarly, replacing reproducer.ll with either of the two variant files will not trigger the bug.

(Finally, adding -global-isel will also avoid the crash)

@llvm/issue-subscribers-backend-amdgpu

Author: Krzysztof Drewniak (krzysz00)

## The issue LLC crashes as follows on an input attached below ``` llc: /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1662: void llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, SlotIndex, unsigned int, SlotIndex): Assertion `(!LeaveBefore || Idx <= LeaveBefore) && "Interference"' failed. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll -o - 1. Running pass 'CallGraph Pass Manager' on module './reproducer.ll'. 2. Running pass 'Greedy Register Allocator' on function '@rock_gemm' [...abort...] #13 0x00000000034492f7 llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, llvm::SlotIndex, unsigned int, llvm::SlotIndex) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/SplitKit.cpp:1668:5 #14 0x00000000033a1630 llvm::RAGreedy::splitAroundRegion(llvm::LiveRangeEdit&, llvm::ArrayRef<unsigned int>) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:11 #15 0x00000000033a263d llvm::RAGreedy::doRegionSplit(llvm::LiveInterval const&, unsigned int, bool, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:3 #16 0x00000000033a1eff llvm::RAGreedy::tryRegionSplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1093:1 #17 0x00000000033a6b01 llvm::RAGreedy::trySplit(llvm::LiveInterval const&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>> const&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:1827:26 #18 0x00000000033a8ce5 llvm::RAGreedy::selectOrSplitImpl(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register>>&, llvm::SmallVector<std::pair<llvm::LiveInterval const*, llvm::MCRegister>, 8u>&, unsigned int) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2476:24 #19 0x00000000033a9337 llvm::RAGreedy::selectOrSplit(llvm::LiveInterval const&, llvm::SmallVectorImpl<llvm::Register>&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2151:7 #20 0x000000000337bd85 llvm::RegAllocBase::allocatePhysRegs() /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocBase.cpp:114:9 #21 0x00000000033ad3cd llvm::RAGreedy::runOnMachineFunction(llvm::MachineFunction&) /home/kdrewnia/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2772:3 [...] ``` A `git bisect` run isolated this crash to only happening after #74467 . While full reproduction information and variant inputs/settings that do or don't cause the crash to occur are provided below, I can report that the flag `-amdgpu-codegenprepare-disable-idiv-expansion=true` removes the failure. ## Reproduction files All of these files are `opt -O3 -mtriple=amdgcn-amd-amdhsa` output. I apologize in advance for the lack of a smaller test case, as `bugpoint` didn't have much luck with this one. [reproducer.ll.txt](https://github.com/llvm/llvm-project/files/14877595/reproducer.ll.txt) is the input that triggers the crash. It is a matrix multiplication implementation. [fewer-batches-passing.ll.txt](https://github.com/llvm/llvm-project/files/14877637/fewer-batches-passing.ll.txt) is that same code but with a lower batch size specified. That is, the input IR was identical to the failing case, but the statically-known (and annotated as a `!range`) number of workgroups differed between these two files. In relevant part, the diff between those two files is ``` --- reproducer.ll 2024-04-04 21:13:02.778679418 +0000 +++ fewer-batches-passing.ll 2024-04-04 21:14:50.335567529 +0000 @@ -5,29 +5,28 @@ target datalayout = "e-p:64:64-p1:64:64- @__wg_rock_gemm_0 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64 @__wg_rock_gemm_1 = internal unnamed_addr addrspace(3) global [8192 x i8] undef, align 64 -define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(805306368) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(100663296) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(301989888) %2) local_unnamed_addr #0 !reqd_work_group_size !0 { +define amdgpu_kernel void @rock_gemm(ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(125829120) %0, ptr inreg noalias nocapture nofree noundef nonnull readonly align 16 dereferenceable(15728640) %1, ptr inreg noalias nocapture nofree noundef nonnull writeonly align 16 dereferenceable(47185920) %2) local_unnamed_addr #0 !reqd_work_group_size !0 { .preheader21.preheader: %3 = tail call i32 @llvm.amdgcn.workgroup.id.x(), !range !1 %.fr = freeze i32 %3 - %.lhs.trunc = trunc i32 %.fr to i16 - %4 = udiv i16 %.lhs.trunc, 24 - %5 = mul i16 %4, 24 - %.decomposed = sub i16 %.lhs.trunc, %5 - %.zext17 = zext nneg i16 %.decomposed to i32 - %.cmp = icmp ugt i16 %.decomposed, 21 + %.lhs.trunc = trunc i32 %.fr to i8 + %4 = udiv i8 %.lhs.trunc, 24 + %5 = mul i8 %4, 24 + %.decomposed = sub i8 %.lhs.trunc, %5 + %.zext17 = zext nneg i8 %.decomposed to i32 + %.cmp = icmp ugt i8 %.decomposed, 21 %6 = select i1 %.cmp, i32 11, i32 0 %7 = sub nuw nsw i32 12, %6 %8 = tail call i32 @llvm.umin.i32(i32 %7, i32 11) - %.lhs.trunc18 = trunc i16 %.decomposed to i8 %.rhs.trunc = trunc i32 %8 to i8 - %9 = urem i8 %.lhs.trunc18, %.rhs.trunc + %9 = urem i8 %.decomposed, %.rhs.trunc @@ -1633,7 +1632,7 @@ attributes #4 = { convergent mustprogres attributes #5 = { nounwind } !0 = !{i32 256, i32 1, i32 1} -!1 = !{i32 0, i32 1536} +!1 = !{i32 0, i32 240} !2 = !{i32 0, i32 256} !3 = !{} !4 = !{!5} ``` [reproducer-barriers-removed.ll.txt](https://github.com/llvm/llvm-project/files/14877651/reproducer-barriers-removed.ll.txt) is `reproducer.ll` with the `call void asm` statements removed. This variant also does not crash. ## Steps to reproduce (The `-mattr` inputs are kept to mach the original source of the bug) ``` llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll ``` This will crash as seen above. However, ``` llc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -mattr=+sramecc,-xnack ./reproducer.ll -amdgpu-codegenprepare-disable-idiv-expansion=true ``` will not crash Similarly, replacing `reproducer.ll` with either of the two variant files will not trigger the bug. (Finally, adding `-global-isel` will also avoid the crash)

llvm / llvm-project