llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.84k stars 11.47k forks source link

[RegAllocGreedy] Folding memory operand alone gives many more uncoalesced copies #36760

Open JonPsson opened 6 years ago

JonPsson commented 6 years ago
Bugzilla Link 37412
Version trunk
OS Linux
Attachments patch: ASI enabled for AHIMux + the test case, side-to-side diff of debug dumps (unpatched <> patched)
CC @JonPsson,@qcolombet,@uweigand

Extended Description

Recently it was discovered that an opcode (AHIMux) was not specifically handled the same way as AHI is in SystemZInstrInfo::foldMemoryOperandImpl(), which it should be since AHIMux (addition of immediate) also can be replaced with ASI (add immediate to memory), in case the register operand is spilled. AHIMux is sort of a sibling opcode to AHI that can use either high or low 32-bit registers.

This was a simple to fix by just adding a check for that opcode in this method. However, when doing so I found that even though my test case now had a few ASIs instead of reload-add-save (which was nice), there were now +20 more COPYs that were not colesced! That was quite considerable, given the quite small test case.

This happened with the very test case that was made for the ASI, which was intended to use enough registers to cause the spilling, so that ASI would be used. Basically it has 32 volatile loads with a conditional branch past a block that adds an immediate to the loaded values, and in the final block the resulting value is stored:

entry: %val0 = load volatile i32 , i32 %ptr ... %val31 = load volatile i32 , i32 %ptr %test = icmp ne i32 %sel, 0 br i1 %test, label %add, label %store

add: %add0 = add i32 %val0, 127 ... %add31 = add i32 %val31, 127 br label %store

store: %new0 = phi i32 [ %val0, %entry ], [ %add0, %add ] ... %new31 = phi i32 [ %val31, %entry ], [ %add31, %add ] store volatile i32 %new0, i32 %ptr ... store volatile i32 %new31, i32 %ptr ret void

After "Simple Register Coalescing", the MachineFunction (note: no COPYs) looks pretty much like

bb.0.entry: %98:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.ptr) ... %129:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.ptr) CHIMux %97:gr32bit, 0, implicit-def $cc BRC 14, 6, %bb.1, implicit killed $cc J %bb.2

bb.1: %98:grx32bit = AHIMux %98:grx32bit, 127, implicit-def dead $cc ... %129:grx32bit = AHIMux %129:grx32bit, 127, implicit-def dead $cc

bb.2: STMux %98:grx32bit, %96:addr64bit, 0, $noreg :: (volatile store 4 into %ir.ptr) ... STMux %129:grx32bit, %96:addr64bit, 0, $noreg :: (volatile store 4 into %ir.ptr) Return

Greedy will now run out of registers and split live ranges and insert COPYs like (without ASI patch):

bb.0.entry: %132:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.ptr) ... %221:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.pt CHIMux %97:gr32bit, 0, implicit-def $cc BRC 14, 6, %bb.1, implicit killed $cc J %bb.1

bb.3: %137:grx32bit = COPY %136:grx32bit ... (23 COPYs) %203:grx32bit = COPY %202:grx32bit J %bb.2

bb.1: %137:grx32bit = COPY %136:grx32bi ... %203:grx32bit = COPY %202:grx32bit %137:grx32bit = AHIMux %137:grx32bit, 127, implicit-def dead $cc ... %222:grx32bit = AHIMux %222:grx32bit, 127, implicit-def dead $cc

Note that the COPYs in bb.1 here lies before the AHIMuxes. For some reason, when ASIs are used in bb.1, the COPYs now instead are inserted after the AHIMuxes:

bb.1.add: ASI %stack.0, 0, 127, implicit-def dead $cc :: (store 4 into %stack.0) %135:grx32bit = AHIMux %135:grx32bit, 127, implicit-def dead $cc ... (also 3 more ASIs) %216:grx32bit = AHIMux %216:grx32bit, 127, implicit-def dead $cc %220:grx32bit = COPY %219:grx32bit ... %136:grx32bit = COPY %135:grx32bit

Unfortunately, after "Virtual Register Rewriter"all those COPYs are still there, while without the ASIs they are gone!

I find this odd and feel I would like some advice on what to do with this. Is there a reason that the COPYs are now inserted in a different position, or is this just a bug?

Attached is a side-to-side diff with '-print-after-all -debug-only=regalloc' (master to the left, ASIs to the right)

qcolombet commented 4 years ago

Currently splitCanCauseEvictionChain identifies 2 very specific scenarios I've encountered, the scenario you're seeing here is probably not one of them.

I think this is exactly what is happening. I think Marina's heuristic detects thing with copy that goes back and forth: dst1 = src1 dst2 = src2 dst3 = src3 ... src3 = dst3 src2 = dst2 src1 = dst1

Whereas in that case, the copies are set to join live ranges:

then: dst1 = src1 dst2 = src2 dst3 = src3 br end

else: dst1 = src1 dst2 = src2 dst3 = src3 br end

end: use of dst1, dst2, dst3

At this point, I would need to spend some time getting into Marina's heuristic to see how we can fix this. I unfortunately don't have the time for that now.

Marina, if you can take a look that would be great.

qcolombet commented 4 years ago

TL;DR Didn't find anything useful yet!

Quick update here. The fact that we get different allocation is expected given the folding creates different live-intervals and thus, changes the order in which things are processed.

Looking at the resulting code, we indeed seem to hit a case of eviction chain but I haven't dug enough to see what we should do to have the mechanism that Marina added to kick in.

At first glance, it seems the live-intervals don't have the shape that is considered to create eviction chain. In particular, it seems the candidates for split are never both live-in and live-out of a bundle.

JonPsson commented 4 years ago

rebased patch with test case

I'll try to take a look at this one this week or next week.

That would be great! I rebased the patch again (needed due to recent changes).

qcolombet commented 4 years ago

I'll try to take a look at this one this week or next week.

JonPsson commented 4 years ago

ping

It would still be nice if someone with regalloc greedy know-how could address this strange regression. I see on SystemZ/SPEC'17 that this patch does have a value if applied:

asi : 13226 13686 +460 ahi : 97161 96712 -449 l : 229532 229086 -446 st : 177344 176900 -444

JonPsson commented 4 years ago

rebased patch with test case This is still problematic with many extra COPY:s after regalloc...

JonPsson commented 5 years ago

Updated patch for AHIMuxK instruction Thanks for taking a look, Quentin. Please see the updated patch that now also handles the AHIMuxK (three-address) instruction, which became the default recently.

qcolombet commented 5 years ago

Hi all,

I finally had a look, but I am puzzled as I am not sure what to look at!

The compiler with and without the attached patch produces the exact same assembly for test/CodeGen/SystemZ/fold-memory-op-impl-02.ll and none of them end up producing asi.

Failing Tests (1):
    LLVM :: CodeGen/SystemZ/fold-memory-op-impl-02.ll

  Unexpected Failures: 1

Does the patch need to be rebased?

qcolombet commented 5 years ago

It'll at to wait one or two more weeks :(

qcolombet commented 5 years ago

I might be able to look at it next week, unless Marina beats me at it.

JonPsson commented 5 years ago

ping! The same test case still shows the same problem...

JonPsson commented 5 years ago

ping. Anyone has the time for this?

JonPsson commented 6 years ago

Jonas, can you try https://reviews.llvm.org/D50064 to see whether it helps with your case? The patch is not reviewed yet, but it can mitigate some of the bad eviction decisions that were fixed before in RA e.g:

renamable $rbx = COPY renamable killed $rax
renamable $rcx = COPY renamable killed $rbx
renamable $rax = COPY renamable killed $rcx
NOOP implicit $rax

=> NOOP implicit $rax

Hi Alexander, thanks for suggesting this, but this unfortunately did not seem to help my test case (no improvement). /Jonas

Ugh, those COPYs are in different basic blocks; So current MachineCopyPropagation pass cannot do anything here, because it is BB-local only

Hi Alexander,

I was just about to say thanks for suggesting this, but this unfortunately did not help my test case, as you predicted.

/Jonas

llvmbot commented 6 years ago

Ugh, those COPYs are in different basic blocks; So current MachineCopyPropagation pass cannot do anything here, because it is BB-local only

llvmbot commented 6 years ago

Jonas, can you try https://reviews.llvm.org/D50064 to see whether it helps with your case? The patch is not reviewed yet, but it can mitigate some of the bad eviction decisions that were fixed before in RA e.g:

renamable $rbx = COPY renamable killed $rax
renamable $rcx = COPY renamable killed $rbx
renamable $rax = COPY renamable killed $rcx
NOOP implicit $rax

=> NOOP implicit $rax

JonPsson commented 6 years ago

Hi Marina,

I certainly understand the complexity of this issue!

I wouldn't mind trying to improve the MachineCopyPropagation pass, except I worry that even if I have a working patch it will get rejected with the reasoning that "RegAlloc should handle this". So before trying anything with that, I think we should have a general agreement that the test case is such that we don't expect RegAlloc to always succeed with it...

In this case it actually looks a bit like RegAlloc should handle this better, or?

llvmbot commented 6 years ago

Hi Jonas,

Sorry for not being able to address this issue yet. Usually these RA issues are quite complex and take some time to figure out in depth. I will resume looking at this over the weekend, but unfortunately my current main focus is a non llvm related project, therefore, it may take a while until I can upload a patch that identifies this bad eviction scenario.

There's another way to partially address such issues - you can add a cleanup pass. At one time Quentin Colombet suggested doing a cleanup of such copies in the MachineCopyPropagation pass, and this may be indeed a good idea to do regardless of improving RA. This does not solve RA's bad decisions, but if these copies are contained in a single basic block, they can be cleaned up there.

Thanks, Marina

JonPsson commented 6 years ago

Hi Marina, I don't want to rush you or anything, but just give a friendly reminder that we are looking forward to resolving this issue... /Jonas

llvmbot commented 6 years ago

Unfortunately I did not have a chance to look at the bug in-deep yet. I hope to find the time next week.

JonPsson commented 6 years ago

Any news on this, Marina?

llvmbot commented 6 years ago

Currently splitCanCauseEvictionChain identifies 2 very specific scenarios I've encountered, the scenario you're seeing here is probably not one of them.

From what I see in the log both the "before" and "after" code seem to have a bad eviction chain (the chain of copies) that moved from one place to another in %bb.2 due to the changes in the memory folding. I hope to look into this a bit deeper in the next few days and identify what is the exact scenario that caused this chain.

I fear the "shouldEvict" heuristic you suggested may be too intrusive and affects many other decisions. We must first pin point the scenario and then decide how to avoid it.

Regarding the difference for vreg %129 the code "before" could not find a way to split for $r31 because of previous assignments, so it chose to do a split that tries to create compact regions and not "fit" it the best way it can to the "empty" parts of a $r31 (more accurately, splitting for $r31 wasn't profitable). In the "after" code the assignments to $r31 have changed, and now there is a way to split %129 to fit $r31 (in a profitable way). When Greedy tries to estimate the split cost it checks if the split for $r31 is one of the known "bad" scenarios, but your root cause is probably different, this is why it says "Best split candidate of vreg %129 may not cause bad eviction chain".

The best way to approach it is to identify the first "bad" COPY created, which seems to be the decision to split vreg %99. It creates %136 and %137 split artifacts for the "before" code and %135 and %136 split artifacts for the "after" code. Then we need to investigate the chain affect that happened here and understand how we can better influence this decision.

I'll keep you updated in the next few days with what I'm able to find.

JonPsson commented 6 years ago

@​Marina: this test case does indeed seem related to splitCanCauseEvictionChain, and since you are the expert here, I would like to ask if you could share some advice?

I have not been able to figure out yet exactly which evictions are bad here, but one observation on this function is that it isn't really sensible to try to evict any register for the sake of another, since they all overlap and are practically identically structured.

The spill weights themselves are extremely small and very similar. I tried to add a heuristic like "don't evict if intervals are very similar" like

diff --git a/lib/CodeGen/RegAllocGreedy.cpp b/lib/CodeGen/RegAllocGreedy.cpp index 182d966..daf8ed7 100644 --- a/lib/CodeGen/RegAllocGreedy.cpp +++ b/lib/CodeGen/RegAllocGreedy.cpp @@ -842,7 +842,7 @@ bool RAGreedy::shouldEvict(LiveInterval &A, bool IsHint, if (CanSplit && IsHint && !BreaksHint) return true;

This made all the evictions for this test case to never happen in the first place, and the result was a general improvement, as expected with more folded operands.

Since checking specialized cases for eviction is quite a task, I wonder if my experiment could be useful somehow as a simplistic improvement? I wonder if there is a limit (or level of similarity) here where it doesn't make sense to go through the trouble of evicting one live range for the other?

Either way, this indicates further as Uli suggested that these evictions are bad and should not happen.

uweigand commented 6 years ago

It seems the interesting difference comes in RS_Split cascade 28, where the original code makes this decision:

Split for compact region in 2 bundles, intv 1. splitAroundRegion with 2 globals.

while the patched code makes this decision:

Best split candidate of vreg %129 may not cause bad eviction chain Split for $r3l in 1 bundles, intv 1. Split for compact region in 1 bundles, intv 2. splitAroundRegion with 3 globals.

Note the debug output says "may not cause bad eviction chain", but the resulting set of COPYs does in fact look suspiciously like a bad eviction chain as described in the comments in RegAllocGreedy.cpp ...