llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.14k stars 12.02k forks source link

Reg alloc reloads register three times in innermost vector loop, although there are many free regs #32069

Open JonPsson opened 7 years ago

JonPsson commented 7 years ago
Bugzilla Link 32722
Version trunk
OS Linux
Attachments reduced test case
CC @MatzeB,@JonPsson,@qcolombet,@uweigand

Extended Description

I noticed that there are reloads of a register which is used as an address inside a vectorized loop.

The loop is innermost, is a single block, and there are 10 unused registers in the loop.

I am thinking that RA should be able to split live ranges around the loop?

bin/llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 -o out.s vecloopspills.ll

.LBB0_18: # %vector.body

=>This Inner Loop Header: Depth=1

    lgfr    %r5, %r3
    sllg    %r5, %r5, 3
    vlrepg  %v0, 0(%r5,%r14)
    ahik    %r14, %r3, -2
    lg      %r13, 200(%r15)         # 8-byte Folded Reload
    lgfr    %r14, %r14
    sllg    %r14, %r14, 3
    vlrepg  %v1, 0(%r14,%r13)
    vmrhg   %v0, %v1, %v0
    lg      %r13, 200(%r15)         # 8-byte Folded Reload
    vflcdb  %v0, %v0
    vsteg   %v0, 0(%r14,%r13), 0
    lg      %r14, 200(%r15)         # 8-byte Folded Reload
    ahi     %r3, 4
    aghi    %r4, -2
    vsteg   %v0, 0(%r5,%r14), 1
    jne     .LBB0_18
JonPsson commented 6 years ago

Now down to ~ 20 cases left.

bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_badloopreg_18.ll -tail-dup-placement=false

Two reloads in loop, even though R0D, R1D and R6D are available.

JonPsson commented 6 years ago

reduced testcase

JonPsson commented 7 years ago

reduced testcase We are now down to ~ 50 known cases on SystemZ / SPEC.

Next test case does spilling three times in loop while there are many free regs available.

Loop of .LBB0_7: 3 spills, yet many regs unused: %R1D, %R2D, %R3D, %R4D, %R5D, %R14D, %R12D, %R11D, %R10D, %R9D, %R8D.

/bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_badloopreg_17.ll -tail-dup-placement=false

Note this may depend on a patch of LiveIntervals.cpp:

Index: lib/CodeGen/LiveInterval.cpp

--- lib/CodeGen/LiveInterval.cpp (revision 316856) +++ lib/CodeGen/LiveInterval.cpp (working copy) @@ -468,8 +468,9 @@ bool LiveRange::overlaps(const LiveRange // I and J are overlapping. Find the later start. SlotIndex Def = std::max(I->start, J->start); // Allow the overlap if Def is a coalescable copy.

JonPsson commented 7 years ago

reduced testcase Found a case of an inner loop (depth 2) which is reloading a function argument into %r1. This is unnecessary, since %r6-%r8 are unused:

.LBB1_6: # %while.body40

Parent Loop BB1_2 Depth=1

                                    # =>  This Inner Loop Header: Depth=2
    lg      %r1, 160(%r15)          # 8-byte Folded Reload
    lgr     %r2, %r12
    lgr     %r3, %r10
    lgr     %r4, %r11
    basr    %r14, %r1
    la      %r10, 0(%r9,%r10)
    sgr     %r13, %r9
    clgrjhe %r13, %r9, .LBB1_6
    j       .LBB1_1

bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 -o out.s ./tc_badloopreg_12.ll -tail-dup-placement=false

JonPsson commented 7 years ago

experimental patch to enable subreg liveness

JonPsson commented 7 years ago

reduced testcase which crashes I found that tc_badloopreg_11.ll (latest test case) indeed got remedied by enabling subreg liveness.

I then wanted to see how much of an effect that would have on the remaining 317 loops I see with unecessary spilling / reloading inside loops.

Unfortunately, I however then ran into an assert with your patch. To reproduce, first apply enableSubRegLiveness.patch (always return true in MRI / subRegLivenessEnabled())

then run

bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_regsplit_crash.ll

lc: /home/ijonpan/llvm/llvm-dev-2/include/llvm/ADT/IntervalMap.h:635: unsigned int llvm::IntervalMapImpl::LeafNode<Key T, ValT, N, Traits>::insertFrom(unsigned int&, unsigned int, KeyT, KeyT, ValT) [with KeyT = llvm::SlotIndex; ValT = llv m::LiveInterval*; unsigned int N = 8u; Traits = llvm::IntervalMapInfo]: Assertion `(i == Size || Trait s::stopLess(b, start(i))) && "Overlapping insert"' failed.

​0 0x0000000002b812a0 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (bin/llc+0x2b812a0)

​1 0x0000000002b7f23e llvm::sys::RunSignalHandlers() (bin/llc+0x2b7f23e)

​2 0x0000000002b7f3b0 SignalHandler(int) (bin/llc+0x2b7f3b0)

​3 0x0000000006e46c8e

​4 0x000003ff9123d4f8 __GI_raise (/lib64/libc.so.6+0x3d4f8)

​5 0x000003ff9123edee __GI_abort (/lib64/libc.so.6+0x3edee)

​6 0x000003ff912351ac __assert_fail_base (/lib64/libc.so.6+0x351ac)

​7 0x000003ff91235254 (/lib64/libc.so.6+0x35254)

​8 0x00000000024afadc llvm::IntervalMapImpl::LeafNode<llvm::SlotIndex, llvm::LiveInterval, 8u, llvm::IntervalMapInfo >::insertFrom(unsigned int&, unsigned int, llvm::SlotIndex, llvm::SlotIndex, llvm::LiveInterval) (bin/llc+0x24afadc)

​9 0x00000000024b9228 llvm::LiveIntervalUnion::unify(llvm::LiveInterval&, llvm::LiveRange const&) (bin/llc+0x24b9228)

​10 0x00000000024beefc llvm::LiveRegMatrix::assign(llvm::LiveInterval&, unsigned int) (bin/llc+0x24beefc)

​11 0x000000000237e206 (anonymous namespace)::RAGreedy::trySplitLoopInAll(llvm::LiveInt

...

JonPsson commented 7 years ago

reduced testcase with reload in loop New test-case with an unnecessary reload inside loop.

%vreg6 is the function argument, which is stored unmodified in the loop in each iteration. Regalloc decides to spill this, and then reloads it every time. Yet, %R12D is unused in the loop.

I think this is due to the fact that %R12Q (%R12D + %R13D) is defined by a division (which is always done with a 128bit register). Then, in the loop only %R13D is used. Would it somehow be reasonable to fix this so that "if only one subreg is used inside loop, then split the wide interval around loop and make a new interval for the subreg for use by the loop"? This would make one more register free in the loop.

I am not sure what is the real problem / best handling... Could it be that instead of coalescing the GR128 (creating the long GR128 interval with the subreg use), it would be better to keep the COPY and then give a regalloc hint for the subreg? Somehow it seems suboptimal to have this long GR128 live range regardless if it's inside a loop or not...

bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_badloopreg_11.ll -tail-dup-placement=false

JonPsson commented 7 years ago

reduced testcase bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 -o out.s tc_badloopreg_10.ll

A single block loop (%vector.body) spills and reloads, yet %R2D seems unused in the block.

JonPsson commented 7 years ago

reduced testcase (this time the .ll file)

JonPsson commented 7 years ago

New test cases:

tc_badreload.ll:

Tiny loop with a reload from loop-invariant stack-slot:

.LBB0_1: # %for.end8280

=>This Inner Loop Header: Depth=1

    ar      %r4, %r5
    ahi     %r3, 1
    lt      %r0, 164(%r15)          # 4-byte Folded Reload
    lgfr    %r4, %r4
    jle     .LBB0_1

tc_badspill_reload.ll: A small inner loop with disjoint blocks: { BB0_6, BB0_3 } spills and reloads, even though %R1D, %R2D, %R5D, %R14D, %R13D, %R11D, %R10D, %R9D, %R8D, %R6D seem unused.

JonPsson commented 7 years ago

test case with regsplit patch applied (both spills and reloads)

JonPsson commented 7 years ago

test case with regsplit patch applied

JonPsson commented 7 years ago

reduced testcase with reload in loop Thanks for help on number 5. I have now fixed the callee saved regs list (only on my experimental branch so far) to include the GR128 and FP128 registers.

Disabling the hot-loops heuristic (to get maximal splitting), I now have ~200 cases left (was ~2500).

Next test case:

A simple increment (with no particular use in the reduced function) in the loop

%vreg4 = LA %vreg4, 1, %noreg; ADDR64Bit:%vreg4

becomes

AGSI <fi#0>, 0, 1, %CC<imp-def,dead>; mem:LDST8[FixedStack0]

which is a adding the constant to memory (fi#0).

R10D is unused in the loop, but live-through the entire loop, so it seems like R10D could have been split around the loop.

bin/llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_reloadinloop_6.ll

JonPsson commented 7 years ago

reduced testcase Next test case with two reloads in loop:

.LBB0_4: # %cond.end171

=>This Inner Loop Header: Depth=1

    chi     %r5, 0
    lgr     %r11, %r0
    locghie %r11, 0
    lg      %r11, -8(%r4,%r11)
    lg      %r13, 200(%r15)         # 8-byte Folded Reload
    clg     %r11, -8(%r4,%r13)
    jle     .LBB0_22

BB#5: # %if.end178

                                    #   in Loop: Header=BB0_4 Depth=1
    clgijl  %r11, 0, .LBB0_20

BB#6: # %if.end178.1

                                    #   in Loop: Header=BB0_4 Depth=1
    lg      %r11, 0(%r4,%r2)
    lg      %r13, 200(%r15)         # 8-byte Folded Reload
    ag      %r11, -16(%r4,%r13)
    clgrjh  %r11, %r0, .LBB0_20

BB#7: # %for.cond150.1

                                    #   in Loop: Header=BB0_4 Depth=1
    aghi    %r14, -2
    ahi     %r5, 2
    aghi    %r4, -16
    aghi    %r3, -2
    cgijh   %r3, 2, .LBB0_4

Unused regs: %r1, %r12, %r10, %r9, %r8, %r7, %r6

JonPsson commented 7 years ago

Comment to last test case (3rd June):

.LBB0_4: # %for.body46

Parent Loop BB0_3 Depth=1

                                    # =>  This Inner Loop Header: Depth=2
    lg      %r0, 192(%r15)          # 16-byte Folded Reload
    lg      %r1, 200(%r15)          # 16-byte Folded Reload
    lgr     %r4, %r1
    stg     %r2, 176(%r15)          # 16-byte Folded Spill
    stg     %r3, 184(%r15)          # 16-byte Folded Spill
    brasl   %r14, memcpy@PLT
    lg      %r0, 192(%r15)          # 16-byte Folded Reload
    lg      %r1, 200(%r15)          # 16-byte Folded Reload
    lgr     %r4, %r1
    brasl   %r14, memcpy@PLT
    lg      %r0, 192(%r15)          # 16-byte Folded Reload
    lg      %r1, 200(%r15)          # 16-byte Folded Reload
    lgr     %r4, %r1
    brasl   %r14, memcpy@PLT
    lg      %r0, 192(%r15)          # 16-byte Folded Reload
    lg      %r1, 200(%r15)          # 16-byte Folded Reload
    lgr     %r4, %r1
    brasl   %r14, memcpy@PLT
    lg      %r2, 176(%r15)          # 16-byte Folded Reload
    lg      %r3, 184(%r15)          # 16-byte Folded Reload
    lg      %r0, 192(%r15)          # 16-byte Folded Reload
    lg      %r1, 200(%r15)          # 16-byte Folded Reload
    aghi    %r3, -4
    stg     %r1, 0(%r1)
    jne     .LBB0_4

call clobbered: %r0-%r5 free: %r13, %r12, %r11, %r10L, %r9, %r8, %r7, %r6

JonPsson commented 7 years ago

reduced testcase Thank you Wei. I am now down to 850. Attaching next test case for you.

llvmbot commented 7 years ago

regsplit around loop.v3

llvmbot commented 7 years ago

Another patch to handle all the three testcases. It is the logic to handle critical edges exiting loop which needs a small improvement.

JonPsson commented 7 years ago

Currently, on SPEC-2006/SystemZ, total number (regardless of loops) of spill /reload instructions go up with ~1%.

Total number of spill / reload instruction in innermost loops however go down with ~17% :-)

JonPsson commented 7 years ago

Thanks - with the second version some additional ~50 loops got handled, so there are still ~1000 loops to go...

llvmbot commented 7 years ago

regsplit around loop.v2

llvmbot commented 7 years ago

Thank you, Jonas.

A simple change for the patch can solve the first two testcases (patch attached. regsplit around loop.v2). The third one is more complex, and I need to take some time to look at it.

JonPsson commented 7 years ago

There is no reloaded hoisting in current regalloc

There is a MachineLICM pass scheduled after regalloc I always assumed it exists to hoist reloads/rematerialized values (didn't check though). It does not appear to be handling reloads intelligently. In my simple example, it would have to understand that there are multiple reloads from the same (loop-constant) stack slot, so if there was a free reg hoist all reloads and replace them with just one using a free reg.

That might work, however it seems that the case where the stack slot is stored to also should also be handled, so my guess is that MachineLICM is not the right place to implement this.

JonPsson commented 7 years ago

There is no reloaded hoisting in current regalloc, because when you hoist the reload, you increase the live range of the temporary var and it raises a new allocation request. It doesn't fit current framework well. To prevent a vreg from being reloaded inside a loop, it is usually achieved by reg splitting. I think the major problem here is why splitting doesn't happen at the proper places.

This was exactly my point - since it would be trivial to find a free preg in a loop after regalloc and then split the live-through (spilled) interval with a reload + COPY to free reg before loop, and if loop also writes to the interval, a store to stack slot after the loop. This would not introduce more complexity to or affect regalloc.

But of course, it might be even better if reg-alloc could do that...

Thanks for trying it out. I tried it on your reduced testcase and it generated %vector.body loop without any spill/reload. Could you provide a testcase from the remaining half of testcase set you mentioned? I am interested to see whether I can improve the patch to handle the remaining half it cannot handle right now.

Sure, I reduced three more test cases for you.

JonPsson commented 7 years ago

test case with regsplit patch applied (both spills and reloads) .LBB0_22: # %for.body383.i

Parent Loop BB0_13 Depth=1

                                    # =>  This Inner Loop Header: Depth=2
    sllg    %r2, %r6, 0(%r10)
    ngr     %r2, %r0
    je      .LBB0_22

BB#23: # %land.lhs.true394.i

                                    #   in Loop: Header=BB0_22 Depth=2
    c       %r0, 0(%r1)
    jh      .LBB0_22

BB#24: # %land.lhs.true403.i

                                    #   in Loop: Header=BB0_22 Depth=2
    l       %r2, 0(%r1)
    clijh   %r0, 52, .LBB0_27

BB#25: # %land.lhs.true403.i

                                    #   in Loop: Header=BB0_22 Depth=2
    sllg    %r3, %r6, 0(%r14)
    ngr     %r3, %r7
    je      .LBB0_27

BB#26: # %cond.true427.i

                                    #   in Loop: Header=BB0_22 Depth=2
    clijl   %r0, 2, .LBB0_22
    j       .LBB0_1

.LBB0_27: # %cond.false445.i

in Loop: Header=BB0_22 Depth=2

    srl     %r2, 8
    llcr    %r2, %r2
    cije    %r2, 18, .LBB0_22

BB#28: # %cond.false445.i

                                    #   in Loop: Header=BB0_22 Depth=2
    cije    %r2, 24, .LBB0_22

BB#29: # %cond.end485.i

                                    #   in Loop: Header=BB0_22 Depth=2
    lg      %r2, 160(%r15)          # 16-byte Folded Reload
    lg      %r3, 168(%r15)          # 16-byte Folded Reload
    llc     %r3, 0(%r1)
    stg     %r2, 160(%r15)          # 16-byte Folded Spill
    stg     %r3, 168(%r15)          # 16-byte Folded Spill
    lg      %r4, 160(%r15)          # 16-byte Folded Reload
    lg      %r5, 168(%r15)          # 16-byte Folded Reload
    lhi     %r2, -1
    ar      %r5, %r2
    stg     %r4, 160(%r15)          # 16-byte Folded Spill
    stg     %r5, 168(%r15)          # 16-byte Folded Spill
    lg      %r2, 160(%r15)          # 16-byte Folded Reload
    lg      %r3, 168(%r15)          # 16-byte Folded Reload
    dlr     %r2, %r0
    cijlh   %r3, 1, .LBB0_22

%r8, %r9, %r11, %r12, %r13 are unused

JonPsson commented 7 years ago

test case with regsplit patch applied

.LBB0_16: # %while.body.us152

=>This Inner Loop Header: Depth=1

    llgc    %r12, 1(%r10,%r2)
    lgr     %r9, %r11
    rosbg   %r9, %r12, 58, 63, 0
    clgrjle %r9, %r1, .LBB0_4

BB#17: # %while.cond.backedge.us165

                                    #   in Loop: Header=BB0_16 Depth=1
    cgrje   %r13, %r10, .LBB0_4

BB#18: # %while.body.us152.1

                                    #   in Loop: Header=BB0_16 Depth=1
    cgije   %r9, -1, .LBB0_4

BB#19: # %while.cond.backedge.us165.1

                                    #   in Loop: Header=BB0_16 Depth=1
    cgrje   %r14, %r10, .LBB0_4

BB#20: # %while.body.us152.2

                                    #   in Loop: Header=BB0_16 Depth=1
    lghi    %r12, -1
    clgrjle %r7, %r12, .LBB0_4

BB#21: # %while.cond.backedge.us165.2

                                    #   in Loop: Header=BB0_16 Depth=1
    lg      %r12, 176(%r15)         # 8-byte Folded Reload
    cgrje   %r12, %r10, .LBB0_4

BB#22: # %while.cond.backedge.us165.3

                                    #   in Loop: Header=BB0_16 Depth=1
    lg      %r12, 168(%r15)         # 8-byte Folded Reload
    la      %r10, 4(%r10)
    aghi    %r8, -4
    aghi    %r3, -4
    la      %r5, 4(%r5)
    cgrjlh  %r12, %r10, .LBB0_16

%r0, %r4 and %r6 are unused in loop.

JonPsson commented 7 years ago

test case with regsplit patch applied Inner loop:

    l       %r0, 180(%r15)          # 4-byte Folded Reload
    ark     %r0, %r0, %r2

%r4 is unused in inner loop.

MatzeB commented 7 years ago

There is no reloaded hoisting in current regalloc

There is a MachineLICM pass scheduled after regalloc I always assumed it exists to hoist reloads/rematerialized values (didn't check though).

llvmbot commented 7 years ago

Did that patch mainly handle cases where there are not sufficient regs in the loop itself?

It doesn't care whether the loop has sufficient regs or not. It tries to treat hot loop as a separate region, split all the live ranges across the loop so that the loop can have the minimal spill.

I think the problem I am seeing is mostly when an interval is spilled, and then > ends up being reloaded inside the loop (and not modified). In other words, it has not so much to do with reg pressure inside the loop...

Would it be an option to do a cleanup after regalloc to simply 1. Scavenge any > free physregs in loop. 2. Replace any such reloads with a single reload before > the loop into a free physreg? Perhaps the spill pass should do that?

There is no reloaded hoisting in current regalloc, because when you hoist the reload, you increase the live range of the temporary var and it raises a new allocation request. It doesn't fit current framework well. To prevent a vreg from being reloaded inside a loop, it is usually achieved by reg splitting. I think the major problem here is why splitting doesn't happen at the proper places.

I applied the patch, and found that about half of the cases were gone. Of the remaining half, it seems still to be mainly the case where a big LiveInterval (live-through in the loop) gets spilled and just reloaded inside the loop.

Thanks for trying it out. I tried it on your reduced testcase and it generated %vector.body loop without any spill/reload. Could you provide a testcase from the remaining half of testcase set you mentioned? I am interested to see whether I can improve the patch to handle the remaining half it cannot handle right now.

JonPsson commented 7 years ago

I applied the patch, and found that about half of the cases were gone. Of the remaining half, it seems still to be mainly the case where a big LiveInterval (live-through in the loop) gets spilled and just reloaded inside the loop.

JonPsson commented 7 years ago

Thanks Wei, very interesting to hear that you have the same problem...

Did that patch mainly handle cases where there are not sufficient regs in the loop itself?

I think the problem I am seeing is mostly when an interval is spilled, and then ends up being reloaded inside the loop (and not modified). In other words, it has not so much to do with reg pressure inside the loop...

Would it be an option to do a cleanup after regalloc to simply 1. Scavenge any free physregs in loop. 2. Replace any such reloads with a single reload before the loop into a free physreg? Perhaps the spill pass should do that?

/Jonas

llvmbot commented 7 years ago

We saw some cases that live ranges are not split around loop causing suboptimal register allocation because some weakness in current region based split. I am not sure whether the root cause here is the same problem because I am not familiar with systemz.

We had a patch to let live range split around loops and it brought us some benefit on x86. However, that patch was not committed and I havn't visited it for a while. We still think the patch is useful. It will be great if you can try it out and see if it can provide some help.

The orginal patch: https://reviews.llvm.org/D17235

The patch rebased is attached.

llvmbot commented 7 years ago

regsplit around loop

qcolombet commented 7 years ago

This should be taken care in RAGreedy::splitAroundRegion.

JonPsson commented 7 years ago

I see that on spec on SystemZ, currently this happens in a surprising number of loops - looking at just the cases of the GRs, this seems to happen in ~7% of the loops (!) (which means in ~4000 out of ~57000).

Half of these loops contain calls. I wonder if this makes any difference? The calls have the regmask operands, which says which registers it clobbers, so I don't see how that would change anything..? (How about stack slots? Is the stack area of the callee safe from the caller? If not, this would explain why some reloads of an otherwise unaltered stack slots do not get hoisted.)

Which part of regalloc should normally handle this?

Any help appreciated...

qcolombet commented 7 years ago

I am thinking that RA should be able to split live ranges around the loop?

Yes, it should. That's strange.