Open JonPsson opened 7 years ago
Now down to ~ 20 cases left.
bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_badloopreg_18.ll -tail-dup-placement=false
Two reloads in loop, even though R0D, R1D and R6D are available.
reduced testcase We are now down to ~ 50 known cases on SystemZ / SPEC.
Next test case does spilling three times in loop while there are many free regs available.
Loop of .LBB0_7: 3 spills, yet many regs unused: %R1D, %R2D, %R3D, %R4D, %R5D, %R14D, %R12D, %R11D, %R10D, %R9D, %R8D.
/bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_badloopreg_17.ll -tail-dup-placement=false
Note this may depend on a patch of LiveIntervals.cpp:
--- lib/CodeGen/LiveInterval.cpp (revision 316856) +++ lib/CodeGen/LiveInterval.cpp (working copy) @@ -468,8 +468,9 @@ bool LiveRange::overlaps(const LiveRange // I and J are overlapping. Find the later start. SlotIndex Def = std::max(I->start, J->start); // Allow the overlap if Def is a coalescable copy.
reduced testcase Found a case of an inner loop (depth 2) which is reloading a function argument into %r1. This is unnecessary, since %r6-%r8 are unused:
.LBB1_6: # %while.body40
# => This Inner Loop Header: Depth=2
lg %r1, 160(%r15) # 8-byte Folded Reload
lgr %r2, %r12
lgr %r3, %r10
lgr %r4, %r11
basr %r14, %r1
la %r10, 0(%r9,%r10)
sgr %r13, %r9
clgrjhe %r13, %r9, .LBB1_6
j .LBB1_1
bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 -o out.s ./tc_badloopreg_12.ll -tail-dup-placement=false
reduced testcase which crashes I found that tc_badloopreg_11.ll (latest test case) indeed got remedied by enabling subreg liveness.
I then wanted to see how much of an effect that would have on the remaining 317 loops I see with unecessary spilling / reloading inside loops.
Unfortunately, I however then ran into an assert with your patch. To reproduce, first apply enableSubRegLiveness.patch (always return true in MRI / subRegLivenessEnabled())
then run
bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_regsplit_crash.ll
lc: /home/ijonpan/llvm/llvm-dev-2/include/llvm/ADT/IntervalMap.h:635: unsigned int llvm::IntervalMapImpl::LeafNode<Key
T, ValT, N, Traits>::insertFrom(unsigned int&, unsigned int, KeyT, KeyT, ValT) [with KeyT = llvm::SlotIndex; ValT = llv
m::LiveInterval*; unsigned int N = 8u; Traits = llvm::IntervalMapInfo
...
reduced testcase with reload in loop New test-case with an unnecessary reload inside loop.
%vreg6 is the function argument, which is stored unmodified in the loop in each iteration. Regalloc decides to spill this, and then reloads it every time. Yet, %R12D is unused in the loop.
I think this is due to the fact that %R12Q (%R12D + %R13D) is defined by a division (which is always done with a 128bit register). Then, in the loop only %R13D is used. Would it somehow be reasonable to fix this so that "if only one subreg is used inside loop, then split the wide interval around loop and make a new interval for the subreg for use by the loop"? This would make one more register free in the loop.
I am not sure what is the real problem / best handling... Could it be that instead of coalescing the GR128 (creating the long GR128 interval with the subreg use), it would be better to keep the COPY and then give a regalloc hint for the subreg? Somehow it seems suboptimal to have this long GR128 live range regardless if it's inside a loop or not...
bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 tc_badloopreg_11.ll -tail-dup-placement=false
reduced testcase bin/llc -mtriple=s390x-linux-gnu -mcpu=z13 -o out.s tc_badloopreg_10.ll
A single block loop (%vector.body) spills and reloads, yet %R2D seems unused in the block.
New test cases:
tc_badreload.ll:
Tiny loop with a reload from loop-invariant stack-slot:
.LBB0_1: # %for.end8280
ar %r4, %r5
ahi %r3, 1
lt %r0, 164(%r15) # 4-byte Folded Reload
lgfr %r4, %r4
jle .LBB0_1
tc_badspill_reload.ll: A small inner loop with disjoint blocks: { BB0_6, BB0_3 } spills and reloads, even though %R1D, %R2D, %R5D, %R14D, %R13D, %R11D, %R10D, %R9D, %R8D, %R6D seem unused.
reduced testcase with reload in loop Thanks for help on number 5. I have now fixed the callee saved regs list (only on my experimental branch so far) to include the GR128 and FP128 registers.
Disabling the hot-loops heuristic (to get maximal splitting), I now have ~200 cases left (was ~2500).
Next test case:
A simple increment (with no particular use in the reduced function) in the loop
%vreg4
becomes
AGSI <fi#0>, 0, 1, %CC<imp-def,dead>; mem:LDST8[FixedStack0]
which is a adding the constant to memory (fi#0).
R10D is unused in the loop, but live-through the entire loop, so it seems like R10D could have been split around the loop.
bin/llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_reloadinloop_6.ll
reduced testcase Next test case with two reloads in loop:
.LBB0_4: # %cond.end171
chi %r5, 0
lgr %r11, %r0
locghie %r11, 0
lg %r11, -8(%r4,%r11)
lg %r13, 200(%r15) # 8-byte Folded Reload
clg %r11, -8(%r4,%r13)
jle .LBB0_22
# in Loop: Header=BB0_4 Depth=1
clgijl %r11, 0, .LBB0_20
# in Loop: Header=BB0_4 Depth=1
lg %r11, 0(%r4,%r2)
lg %r13, 200(%r15) # 8-byte Folded Reload
ag %r11, -16(%r4,%r13)
clgrjh %r11, %r0, .LBB0_20
# in Loop: Header=BB0_4 Depth=1
aghi %r14, -2
ahi %r5, 2
aghi %r4, -16
aghi %r3, -2
cgijh %r3, 2, .LBB0_4
Unused regs: %r1, %r12, %r10, %r9, %r8, %r7, %r6
Comment to last test case (3rd June):
.LBB0_4: # %for.body46
# => This Inner Loop Header: Depth=2
lg %r0, 192(%r15) # 16-byte Folded Reload
lg %r1, 200(%r15) # 16-byte Folded Reload
lgr %r4, %r1
stg %r2, 176(%r15) # 16-byte Folded Spill
stg %r3, 184(%r15) # 16-byte Folded Spill
brasl %r14, memcpy@PLT
lg %r0, 192(%r15) # 16-byte Folded Reload
lg %r1, 200(%r15) # 16-byte Folded Reload
lgr %r4, %r1
brasl %r14, memcpy@PLT
lg %r0, 192(%r15) # 16-byte Folded Reload
lg %r1, 200(%r15) # 16-byte Folded Reload
lgr %r4, %r1
brasl %r14, memcpy@PLT
lg %r0, 192(%r15) # 16-byte Folded Reload
lg %r1, 200(%r15) # 16-byte Folded Reload
lgr %r4, %r1
brasl %r14, memcpy@PLT
lg %r2, 176(%r15) # 16-byte Folded Reload
lg %r3, 184(%r15) # 16-byte Folded Reload
lg %r0, 192(%r15) # 16-byte Folded Reload
lg %r1, 200(%r15) # 16-byte Folded Reload
aghi %r3, -4
stg %r1, 0(%r1)
jne .LBB0_4
call clobbered: %r0-%r5 free: %r13, %r12, %r11, %r10L, %r9, %r8, %r7, %r6
reduced testcase Thank you Wei. I am now down to 850. Attaching next test case for you.
Another patch to handle all the three testcases. It is the logic to handle critical edges exiting loop which needs a small improvement.
Currently, on SPEC-2006/SystemZ, total number (regardless of loops) of spill /reload instructions go up with ~1%.
Total number of spill / reload instruction in innermost loops however go down with ~17% :-)
Thanks - with the second version some additional ~50 loops got handled, so there are still ~1000 loops to go...
Thank you, Jonas.
A simple change for the patch can solve the first two testcases (patch attached. regsplit around loop.v2). The third one is more complex, and I need to take some time to look at it.
There is no reloaded hoisting in current regalloc
There is a
MachineLICM
pass scheduled after regalloc I always assumed it exists to hoist reloads/rematerialized values (didn't check though). It does not appear to be handling reloads intelligently. In my simple example, it would have to understand that there are multiple reloads from the same (loop-constant) stack slot, so if there was a free reg hoist all reloads and replace them with just one using a free reg.
That might work, however it seems that the case where the stack slot is stored to also should also be handled, so my guess is that MachineLICM is not the right place to implement this.
There is no reloaded hoisting in current regalloc, because when you hoist the reload, you increase the live range of the temporary var and it raises a new allocation request. It doesn't fit current framework well. To prevent a vreg from being reloaded inside a loop, it is usually achieved by reg splitting. I think the major problem here is why splitting doesn't happen at the proper places.
This was exactly my point - since it would be trivial to find a free preg in a loop after regalloc and then split the live-through (spilled) interval with a reload + COPY to free reg before loop, and if loop also writes to the interval, a store to stack slot after the loop. This would not introduce more complexity to or affect regalloc.
But of course, it might be even better if reg-alloc could do that...
Thanks for trying it out. I tried it on your reduced testcase and it generated %vector.body loop without any spill/reload. Could you provide a testcase from the remaining half of testcase set you mentioned? I am interested to see whether I can improve the patch to handle the remaining half it cannot handle right now.
Sure, I reduced three more test cases for you.
test case with regsplit patch applied (both spills and reloads) .LBB0_22: # %for.body383.i
# => This Inner Loop Header: Depth=2
sllg %r2, %r6, 0(%r10)
ngr %r2, %r0
je .LBB0_22
# in Loop: Header=BB0_22 Depth=2
c %r0, 0(%r1)
jh .LBB0_22
# in Loop: Header=BB0_22 Depth=2
l %r2, 0(%r1)
clijh %r0, 52, .LBB0_27
# in Loop: Header=BB0_22 Depth=2
sllg %r3, %r6, 0(%r14)
ngr %r3, %r7
je .LBB0_27
# in Loop: Header=BB0_22 Depth=2
clijl %r0, 2, .LBB0_22
j .LBB0_1
.LBB0_27: # %cond.false445.i
srl %r2, 8
llcr %r2, %r2
cije %r2, 18, .LBB0_22
# in Loop: Header=BB0_22 Depth=2
cije %r2, 24, .LBB0_22
# in Loop: Header=BB0_22 Depth=2
lg %r2, 160(%r15) # 16-byte Folded Reload
lg %r3, 168(%r15) # 16-byte Folded Reload
llc %r3, 0(%r1)
stg %r2, 160(%r15) # 16-byte Folded Spill
stg %r3, 168(%r15) # 16-byte Folded Spill
lg %r4, 160(%r15) # 16-byte Folded Reload
lg %r5, 168(%r15) # 16-byte Folded Reload
lhi %r2, -1
ar %r5, %r2
stg %r4, 160(%r15) # 16-byte Folded Spill
stg %r5, 168(%r15) # 16-byte Folded Spill
lg %r2, 160(%r15) # 16-byte Folded Reload
lg %r3, 168(%r15) # 16-byte Folded Reload
dlr %r2, %r0
cijlh %r3, 1, .LBB0_22
%r8, %r9, %r11, %r12, %r13 are unused
test case with regsplit patch applied
.LBB0_16: # %while.body.us152
llgc %r12, 1(%r10,%r2)
lgr %r9, %r11
rosbg %r9, %r12, 58, 63, 0
clgrjle %r9, %r1, .LBB0_4
# in Loop: Header=BB0_16 Depth=1
cgrje %r13, %r10, .LBB0_4
# in Loop: Header=BB0_16 Depth=1
cgije %r9, -1, .LBB0_4
# in Loop: Header=BB0_16 Depth=1
cgrje %r14, %r10, .LBB0_4
# in Loop: Header=BB0_16 Depth=1
lghi %r12, -1
clgrjle %r7, %r12, .LBB0_4
# in Loop: Header=BB0_16 Depth=1
lg %r12, 176(%r15) # 8-byte Folded Reload
cgrje %r12, %r10, .LBB0_4
# in Loop: Header=BB0_16 Depth=1
lg %r12, 168(%r15) # 8-byte Folded Reload
la %r10, 4(%r10)
aghi %r8, -4
aghi %r3, -4
la %r5, 4(%r5)
cgrjlh %r12, %r10, .LBB0_16
%r0, %r4 and %r6 are unused in loop.
test case with regsplit patch applied Inner loop:
l %r0, 180(%r15) # 4-byte Folded Reload
ark %r0, %r0, %r2
%r4 is unused in inner loop.
There is no reloaded hoisting in current regalloc
There is a MachineLICM
pass scheduled after regalloc I always assumed it exists to hoist reloads/rematerialized values (didn't check though).
Did that patch mainly handle cases where there are not sufficient regs in the loop itself?
It doesn't care whether the loop has sufficient regs or not. It tries to treat hot loop as a separate region, split all the live ranges across the loop so that the loop can have the minimal spill.
I think the problem I am seeing is mostly when an interval is spilled, and then > ends up being reloaded inside the loop (and not modified). In other words, it has not so much to do with reg pressure inside the loop...
Would it be an option to do a cleanup after regalloc to simply 1. Scavenge any > free physregs in loop. 2. Replace any such reloads with a single reload before > the loop into a free physreg? Perhaps the spill pass should do that?
There is no reloaded hoisting in current regalloc, because when you hoist the reload, you increase the live range of the temporary var and it raises a new allocation request. It doesn't fit current framework well. To prevent a vreg from being reloaded inside a loop, it is usually achieved by reg splitting. I think the major problem here is why splitting doesn't happen at the proper places.
I applied the patch, and found that about half of the cases were gone. Of the remaining half, it seems still to be mainly the case where a big LiveInterval (live-through in the loop) gets spilled and just reloaded inside the loop.
Thanks for trying it out. I tried it on your reduced testcase and it generated %vector.body loop without any spill/reload. Could you provide a testcase from the remaining half of testcase set you mentioned? I am interested to see whether I can improve the patch to handle the remaining half it cannot handle right now.
I applied the patch, and found that about half of the cases were gone. Of the remaining half, it seems still to be mainly the case where a big LiveInterval (live-through in the loop) gets spilled and just reloaded inside the loop.
Thanks Wei, very interesting to hear that you have the same problem...
Did that patch mainly handle cases where there are not sufficient regs in the loop itself?
I think the problem I am seeing is mostly when an interval is spilled, and then ends up being reloaded inside the loop (and not modified). In other words, it has not so much to do with reg pressure inside the loop...
Would it be an option to do a cleanup after regalloc to simply 1. Scavenge any free physregs in loop. 2. Replace any such reloads with a single reload before the loop into a free physreg? Perhaps the spill pass should do that?
/Jonas
We saw some cases that live ranges are not split around loop causing suboptimal register allocation because some weakness in current region based split. I am not sure whether the root cause here is the same problem because I am not familiar with systemz.
We had a patch to let live range split around loops and it brought us some benefit on x86. However, that patch was not committed and I havn't visited it for a while. We still think the patch is useful. It will be great if you can try it out and see if it can provide some help.
The orginal patch: https://reviews.llvm.org/D17235
The patch rebased is attached.
This should be taken care in RAGreedy::splitAroundRegion.
I see that on spec on SystemZ, currently this happens in a surprising number of loops - looking at just the cases of the GRs, this seems to happen in ~7% of the loops (!) (which means in ~4000 out of ~57000).
Half of these loops contain calls. I wonder if this makes any difference? The calls have the regmask operands, which says which registers it clobbers, so I don't see how that would change anything..? (How about stack slots? Is the stack area of the callee safe from the caller? If not, this would explain why some reloads of an otherwise unaltered stack slots do not get hoisted.)
Which part of regalloc should normally handle this?
Any help appreciated...
I am thinking that RA should be able to split live ranges around the loop?
Yes, it should. That's strange.
Extended Description
I noticed that there are reloads of a register which is used as an address inside a vectorized loop.
The loop is innermost, is a single block, and there are 10 unused registers in the loop.
I am thinking that RA should be able to split live ranges around the loop?
bin/llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 -o out.s vecloopspills.ll
.LBB0_18: # %vector.body
=>This Inner Loop Header: Depth=1