Open nathanchance opened 2 months ago
@llvm/issue-subscribers-backend-powerpc
Author: Nathan Chancellor (nathanchance)
This was also reported at https://bugzilla.redhat.com/show_bug.cgi?id=2296499, cc @tstellar @nikic
Perhaps this is similar to https://github.com/ClangBuiltLinux/linux/issues/2014 and its solution?
Let me clarify the issue first. From https://godbolt.org/z/7Ed5nbrbb, clang trunk is able to compile with -Wframe-larger-than=2048
for ppc64le-unknown-linux-gnu. Do you mean that clang for ppc64le uses larger stack memory than other platforms does? The trunk clang uses 1728 bytes.
Register allocation uses too many stack slot to do spill/reload.
194 regalloc - Number of spill slots allocated
Too many global variables(generated by -fsanitize=array-bounds
) are hoisted out of the loop because global variables accesses are all marked as reMaterializable
on PPC. The reMaterializable
instructions are hoisted without considering the register pressure. so the liveranges for these global variables become quite large and lead to bad register allocations result.
630 machinelicm - Number of machine instructions hoisted out of loop
diff --git a/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp b/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
index d2195cfbdc5c..ed4a2b943383 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
+++ b/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
@@ -1089,9 +1089,9 @@ bool PPCInstrInfo::isReallyTriviallyReMaterializable(
case PPC::LIS:
case PPC::LIS8:
case PPC::ADDIStocHA:
- case PPC::ADDIStocHA8:
+ //case PPC::ADDIStocHA8:
case PPC::ADDItocL:
- case PPC::ADDItocL8:
+ //case PPC::ADDItocL8:
case PPC::LOAD_STACK_GUARD:
case PPC::PPCLdFixedAddr:
case PPC::XXLXORz:
This can make the stack usage reduce to 208
diff --git a/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp b/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp index d2195cfbdc5c..ed4a2b943383 100644 --- a/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp +++ b/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp @@ -1089,9 +1089,9 @@ bool PPCInstrInfo::isReallyTriviallyReMaterializable( case PPC::LIS: case PPC::LIS8: case PPC::ADDIStocHA: - case PPC::ADDIStocHA8: + //case PPC::ADDIStocHA8: case PPC::ADDItocL: - case PPC::ADDItocL8: + //case PPC::ADDItocL8: case PPC::LOAD_STACK_GUARD: case PPC::PPCLdFixedAddr: case PPC::XXLXORz:
This can make the stack usage reduce to 208
This is an evidence for the root cause. Making ADDIStocHA8
and ADDItocL8
as ReMaterializable is correct. We need to find out why RA does not pull down the ReMaterializable instructions before it allocates so many spill slots.
Hi @nathanchance Thanks for reporting this issue. We may need sometime to fix it since we need to check the register allocation. If you are blocked by this issue, you can try with option -mllvm -disable-machine-licm
. With this option, the stack size is comparable with other platforms. See https://godbolt.org/z/8TeP84E5j
Hi @qcolombet do you have any insights about why Greedy Register Allocator does not pull down below reMaterializable instructions like MachineLICM assumes? Because Greedy Register Allocator does not pull down these instructions as expected, so many registers spills/reloads happen. Thank you!
%1388:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @155
%1389:g8rc = ADDItocL8 %1388:g8rc_and_g8rc_nox0, @155
%1404:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @157
%1405:g8rc = ADDItocL8 %1404:g8rc_and_g8rc_nox0, @157
%1427:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @159
%1428:g8rc = ADDItocL8 %1427:g8rc_and_g8rc_nox0, @159
%1416:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @158
%1417:g8rc = ADDItocL8 %1416:g8rc_and_g8rc_nox0, @158
%1400:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @156
%1401:g8rc = ADDItocL8 %1400:g8rc_and_g8rc_nox0, @156
%1388:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @155
%1389:g8rc = ADDItocL8 %1388:g8rc_and_g8rc_nox0, @155
These are the common medium code model pattern to get the address of the global variable @155
on PPC. ADDIStocHA8
and ADDItocL8
are both rematerializable instructions.
@chenzheng1030 This may be a limitation due to the two-instruction sequence to access the address. Perhaps it is worthwhile experimenting with providing a single-instruction pseudo to accomplish this which would be rematerializable.
@chenzheng1030 This may be a limitation due to the two-instruction sequence to access the address. Perhaps it is worthwhile experimenting with providing a single-instruction pseudo to accomplish this which would be rematerializable.
Thanks. We hit similar issue with big imm load before. In that case, some instructions from remat the big imm load do not have isReMaterializable
flag. For that case, the solution is to use a pseudo isReMaterializable
instruction I will check the issue too.
BTW, -mcmodel=small
also makes the RA work worse, but LDToc
from small code model is not a rematerializable instruction.
Hi @qcolombet do you have any insights about why Greedy Register Allocator does not pull down below reMaterializable instructions like MachineLICM assumes? Because Greedy Register Allocator does not pull down these instructions as expected, so many registers spills/reloads happen. Thank you!
%1388:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @155 %1389:g8rc = ADDItocL8 %1388:g8rc_and_g8rc_nox0, @155 %1404:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @157 %1405:g8rc = ADDItocL8 %1404:g8rc_and_g8rc_nox0, @157 %1427:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @159 %1428:g8rc = ADDItocL8 %1427:g8rc_and_g8rc_nox0, @159 %1416:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @158 %1417:g8rc = ADDItocL8 %1416:g8rc_and_g8rc_nox0, @158 %1400:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @156 %1401:g8rc = ADDItocL8 %1400:g8rc_and_g8rc_nox0, @156
%1388:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @155 %1389:g8rc = ADDItocL8 %1388:g8rc_and_g8rc_nox0, @155
These are the common medium code model pattern to get the address of the global variable
@155
on PPC.ADDIStocHA8
andADDItocL8
are both rematerializable instructions.
Sorry just seeing this. I'll have a quick look.
In the future, couple of things:
@chenzheng1030 This may be a limitation due to the two-instruction sequence to access the address. Perhaps it is worthwhile experimenting with providing a single-instruction pseudo to accomplish this which would be rematerializable.
I haven't looked at the log yet, but if indeed the thing that needs to be rematerialized is more than one instruction at a time, RA as it stands, won't do it. In other words, at this point a pseudo is better.
Thanks very much for having a check @qcolombet .
Here is a reduced case:
int a1;
int a2;
int a3;
int a4;
int a5;
int a6;
int a7;
int a8;
int a9;
int a10;
int a11;
int a12;
int a13;
int a14;
int a15;
int a16;
int a17;
int a18;
int a19;
int a20;
int a21;
int a22;
int a23;
int a24;
int a25;
int a26;
int a27;
int a28;
int a29;
int a30;
int a31;
int a32;
int a33;
int a34;
int a35;
int a36;
extern int bar1(int a);
extern int bar2(int* a);
int foo(int *arr1, int *arr2, int *arr3, int *arr4, int *arr5, int *arr6, int *arr7, int *arr8, int *arr9, int count) {
int sum = 0;
for (int i = 0; i < count; i++)
sum += bar2(&a1) * arr1[i] + bar2(&a2) * arr1[i-1] + bar2(&a3) * arr1[i-2] + bar2(&a4) * arr1[i-3] +
bar2(&a5) * arr2[i] + bar2(&a6) * arr2[i-1] + bar2(&a7) * arr2[i-2] + bar2(&a8) * arr2[i-3] +
bar2(&a9) * arr3[i] + bar2(&a10) * arr3[i-1] + bar2(&a11) * arr3[i-2] + bar2(&a12) * arr3[i-3] +
bar2(&a13) * arr4[i] + bar2(&a14) * arr4[i-1] + bar2(&a15) * arr4[i-2] + bar2(&a16) * arr4[i-3] +
bar2(&a17) * arr5[i] + bar2(&a18) * arr5[i-1] + bar2(&a19) * arr5[i-2] + bar2(&a20) * arr5[i-3] +
bar2(&a21) * arr6[i] + bar2(&a22) * arr6[i-1] + bar2(&a23) * arr6[i-2] + bar2(&a24) * arr6[i-3] +
bar2(&a25) * arr7[i] + bar2(&a26) * arr7[i-1] + bar2(&a27) * arr7[i-2] + bar2(&a28) * arr7[i-3] +
bar2(&a29) * arr8[i] + bar2(&a30) * arr8[i-1] + bar2(&a31) * arr8[i-2] + bar2(&a32) * arr8[i-3] +
bar2(&a33) * arr9[i] + bar2(&a34) * arr9[i-1] + bar2(&a35) * arr9[i-2] + bar2(&a36) * arr9[i-3];
return bar1(sum);
}
Before RA, in the loop preheader,
480B %46:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a1
512B %53:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a2
544B %59:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a3
548B %47:g8rc = ADDItocL8 %46:g8rc_and_g8rc_nox0, @a1, implicit $x2 // medium code model to get address of @a1
552B %54:g8rc = ADDItocL8 %53:g8rc_and_g8rc_nox0, @a2, implicit $x2
560B %60:g8rc = ADDItocL8 %59:g8rc_and_g8rc_nox0, @a3, implicit $x2
576B %65:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a4
592B %66:g8rc = ADDItocL8 %65:g8rc_and_g8rc_nox0, @a4, implicit $x2
608B %71:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a5
624B %72:g8rc = ADDItocL8 %71:g8rc_and_g8rc_nox0, @a5, implicit $x2
640B %78:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a6
656B %79:g8rc = ADDItocL8 %78:g8rc_and_g8rc_nox0, @a6, implicit $x2
672B %84:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a7
688B %85:g8rc = ADDItocL8 %84:g8rc_and_g8rc_nox0, @a7, implicit $x2
704B %90:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a8
720B %91:g8rc = ADDItocL8 %90:g8rc_and_g8rc_nox0, @a8, implicit $x2
After RA, in loop preheader:
560B %326:g8rc = ADDItocL8 %46:g8rc_and_g8rc_nox0, @a1, implicit $x2
568B STD %326:g8rc, 0, %stack.1 :: (store (s64) into %stack.1)
576B %327:g8rc = ADDItocL8 %53:g8rc_and_g8rc_nox0, @a2, implicit $x2
584B STD %327:g8rc, 0, %stack.2 :: (store (s64) into %stack.2)
592B %328:g8rc = ADDItocL8 %59:g8rc_and_g8rc_nox0, @a3, implicit $x2
600B STD %328:g8rc, 0, %stack.3 :: (store (s64) into %stack.3)
...
848B %109:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a11
864B %336:g8rc = ADDItocL8 %109:g8rc_and_g8rc_nox0, @a11, implicit $x2
872B STD %336:g8rc, 0, %stack.11 :: (store (s64) into %stack.11)
880B %115:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a12
896B %337:g8rc = ADDItocL8 %115:g8rc_and_g8rc_nox0, @a12, implicit $x2
904B STD %337:g8rc, 0, %stack.12 :: (store (s64) into %stack.12)
912B %121:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a13
928B %338:g8rc = ADDItocL8 %121:g8rc_and_g8rc_nox0, @a13, implicit $x2
936B STD %338:g8rc, 0, %stack.13 :: (store (s64) into %stack.13)
For each variable with medium code model, PPC needs two instructions to get its address, instruction ADDItocL8
and ADDIStocHA8
which are both ReMaterializable.
In the greedy RA,
ADDIStocHA8
is pulled down to the loop, like variable a1
, a2
, a3
. But the left rematerializable instructions ADDItocL8
in the loop preheader still requires spills.ADDItocL8
and ADDIStocHA8
are still kept inside the preheader, so the spill is also required for these variables, like a11
, a12
, a13
.If RA can not pull down two rematerializable instructions for a1
, a2
, a3
, it can not match the assumption in MachineLICM. MachineLICM treats rematerializable instructions as always be profitable to hoist. These rematerializable instructions relies on RA can pull them down when needed.
For a11
, a12
, a13
, it seems more confused that RA spills directly for ADDIStocHA8
without pulling it down.
For now, small code model uses only one instructions LDtoc
. LDtoc
is not a remat instructions for now thus, register allocator does not pull down LDtoc
, so that there are many spills for the small code model. Making LDtoc
as remat can make register allocator pull down the LDtoc
as expected. Similarly a pseudo remat instruction for medium code model should work.
Alright, I had a quick look and indeed the issue is that the code sequence to rematerialize is more than one-instruction long.
To take an example, consider:
848B %109:g8rc_and_g8rc_nox0 = ADDIStocHA8 $x2, @a11
864B %336:g8rc = ADDItocL8 %109:g8rc_and_g8rc_nox0, @a11, implicit $x2
The live-range of %109
is super short so it is not problematic wrt RA (i.e., its interference with the rest of the live-ranges are minimal), but it will be a problem for rematerialization. Coming back to that in a second.
So now, the long lived live-range %336
is the one that gets spilled.
What happens is %336
is marked as rematerializable so the inline spiller tries to do that before spilling. However, %336
needs %109
to be rematerialized and since this one has a super short live-range, it is not available at the point where we try to rematerialize %336
. RA is not smart enough to rematerialize a second instruction (e.g., to bring %109
in) and as a result, it fails to rematerialize %336
and we spill it.
The workaround would be to use a pseudo to have only one instruction to rematerialize. Ideally fixing RA would be best, but this is not going to happen anytime soon.
The alternatively would be smarter about MachineLICM, but that's also likely to be a heuristic hell!
I see. Thanks for the explanation about why the pattern can not be rematerialized. Yep, I tried to add a new heuristic in MachineLICM for another case about big immediates loading with more than one instructions in a phabricator patch, the reviewer also said no more heuristic should be added in MachineLICM. The solution for that is to add a pseudo instruction to load big immediate on PPC.
So for this issue, we also need a pseudo instruction to load the content of a TOC entry for medium code model. I'll do this.
Another necessary fix is to make LDtoc
/LWZtoc
as remateriable, otherwise too many spills for above case with small code model too.
A recent change in Fedora's configuration to build certain files in the
drivers/gpu/drm
folder of the Linux kernel with-Werror
revealed an instance of-Wframe-larger-than
indrivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
that appears only when building for PowerPC. It shows up with-fsanitize=array-bounds
, which becomes apparent why from the reproducer that I got fromcvise
:Based on this reproducer, PowerPC performs much worse compared to several other architectures:
I've tentatively marked this as an issue with the PowerPC backend.