Open weiguozhi opened 8 years ago
Hi Carrot,
Just let you know we are still discussing this issue internally. And it might take times to have conclusion.
CY
I agree your comment.
I did some experiments on 400.perlbench, and I found "-fno-strict-aliasing" may be the main reason that cause current llvm generates slower code than gcc.
In my case, I found there were extra flags (EXTRA_CFLAGS) used by 400.perlbench: -fgnu89-inline -fno-strict-aliasing
I removed "-fno-strict-aliasing" when compiling by llvm and I got 1% performance improvement.
My experiments: Base options: -m64 -O3 -mcpu=power8 -funroll-loops -mno-vsx -ffast-math -mrecip=!divd -fomit-frame-pointer
With -fno-strict-aliasing: clang-3.9.0 (r266153) = 557.33 sec at9.0 (gcc5.2.1) = 549.33 sec
w/o -fno-strict-aliasing: clang-3.9.0 (r266153) = 551.33 sec at9.0 (gcc5.2.1) = 563.67 sec
So, looks like "-fno-strict-aliasing" is the root cause. Did you also use "-fno-strict-aliasing" in EXTRA_CFLAGS to compile 400.perlbench?
Hi Carrot,
Sorry, you are right! The performance is not improved as you mentioned. It's a frustrating result. We need some internal discussion for our next step. I'm afraid software instruction scheduling in prologue/epilogue is not helpful for such kind of out-of-order machine :(
CY
I guess the problem is in a typical prolog/epilog most of the instructions are low latency instructions except mflr, it means the instructions before mflr can finish execution and retirement quickly, instructions after mflr must wait for the retirement of mflr since retirement must be in order. So the latency of mflr can't be hidden by scheduling.
My llvm/clang revision is 263000.
Our production machine doesn't have cpufreq-set. So I can only run the benchmarks multiple times to get a stable result.
Ok, I will re-run experiments and check performance score again.
Hi Carrot,
Sorry, you are right! The performance is not improved as you mentioned. It's a frustrating result. We need some internal discussion for our next step. I'm afraid software instruction scheduling in prologue/epilogue is not helpful for such kind of out-of-order machine :(
CY
My llvm/clang revision is 263000.
Our production machine doesn't have cpufreq-set. So I can only run the benchmarks multiple times to get a stable result.
Ok, I will re-run experiments and check performance score again.
Hi Carrot,
Could you please give me your llvm/clang revision?
Also, how did you test performance, e.g. for us, we will bind the running on physical cpu0, and set cpu0 to performance governor:
sudo cpufreq-set -c 0 -g performance taskset -c 0 runspec ...
// after the test, reset cpu0 to ondemand governor sudo cpufreq-set -c 0 -g ondemand
CY
My llvm/clang revision is 263000.
Our production machine doesn't have cpufreq-set. So I can only run the benchmarks multiple times to get a stable result.
Hi CY
I tested the patch http://reviews.llvm.org/D18030 on power8, following is the execution time of several runs of mcf,
w/o w/ the patch
222.7s 216.8s 219.5s 219.0s 217.9s 221.8s 220.0s 222.4s
It seems there is no difference.
Hi Carrot,
Could you please give me your llvm/clang revision?
Also, how did you test performance, e.g. for us, we will bind the running on physical cpu0, and set cpu0 to performance governor:
sudo cpufreq-set -c 0 -g performance taskset -c 0 runspec ...
// after the test, reset cpu0 to ondemand governor sudo cpufreq-set -c 0 -g ondemand
CY
Hi CY
I tested the patch http://reviews.llvm.org/D18030 on power8, following is the execution time of several runs of mcf,
w/o w/ the patch
222.7s 216.8s 219.5s 219.0s 217.9s 221.8s 220.0s 222.4s
It seems there is no difference.
Hi Carrot,
I uploaded the patch to phabricator: http://reviews.llvm.org/D18030
I re-benchmark again, and I found it benefit performance, could you download the patch and test on your machine, thanks!
I also uploaded a patch that try to use mfocrf in prologue when appropriate. mfocrf has short latency compares to mfcr, we can get additional benefit from using this instruction. http://reviews.llvm.org/D17749
CY
Hi Carrot,
The patch is our current working. Actually we have made it few weeks ago. This patch passed our internal full test, we have also benchmarked it.
Below is one of the example and result:
Original: With the patch (Reordered):
mflr 0 stdu 1, -240(1) mfcr 12 mflr 0 std 0, 16(1) std 14, 96(1) stw 12, 8(1) std 15, 104(1) stdu 1, -240(1) std 16, 112(1) std 14, 96(1) std 17, 120(1) std 15, 104(1) std 18, 128(1) std 16, 112(1) mfcr 12 std 17, 120(1) std 19, 136(1) std 18, 128(1) std 30, 224(1) std 19, 136(1) std 31, 232(1) std 30, 224(1) std 0, 256(1) std 31, 232(1) stw 12, 248(1)
bl bar bl bar nop nop ld 31, 232(1) ld 0, 256(1) ld 30, 224(1) lwz 12, 248(1) ld 19, 136(1) ld 31, 232(1) ld 18, 128(1) ld 30, 224(1) ld 17, 120(1) ld 19, 136(1) ld 16, 112(1) ld 18, 128(1) ld 15, 104(1) ld 17, 120(1) ld 14, 96(1) ld 16, 112(1) addi 1, 1, 240 ld 15, 104(1) ld 0, 16(1) ld 14, 96(1) lwz 12, 8(1) mtocrf 32, 12 mtocrf 32, 12 mtlr 0 mtlr 0 addi 1, 1, 240 blr blr
Our SPEC2006-INT/FP benchmark result didn't show obvious performance improvement, that's why we hesitated about whether to fire the patch or not.
It looks like for out-of-order cpu which has big instruction window, plenty execution pipe, plenty issue queue can decrease software instruction shceduler's effort.
http://users.elis.ugent.be/~leeckhou/papers/hipeac08-eyerman.pdf
Anyway, at least this patch relax instruction order constraint in prologue and epilogue, we will upload it later.
CY
A typical function prloluge/epilogue on ppc64 looks like
mflr r0 // P1 std r0,16(r1) // P2 ... ld r0, 16(r1) // E1 mtlr r0 // E2 blr // E3
P2 depends on P1, E2 depends E1 and E3 depends on E2. Following code sequence can completely remove the dependences (P2, P1) (E2, E1),
std r14, 16(r1) mflr r14 ... mtlr r14 ld r14, 16(r1) blr
One problem in this code sequence is the stack frame layout is changed, it can impact unwinding (and exceptions). Correct dwarf information can solve this problem(at least in spec2006).
I tested the following patch on spec2006 c/c++ applications. Unfortunately it doesn't bring obvious performance improvement. Only 3~4s improvement for astar, it is less than 1%, no changes to other applications.
--- lib/Target/PowerPC/PPCFrameLowering.cpp (revision 262420) +++ lib/Target/PowerPC/PPCFrameLowering.cpp (working copy) @@ -851,8 +851,13 @@ .addReg(SPReg); }
.addReg(SPReg);
if (MustSaveCR && !(SingleScratchReg && MustSaveLR)) { // will only occur for PPC64 @@ -888,10 +893,11 @@
if (MustSaveLR) // FIXME: On PPC32 SVR4, we must not spill before claiming the stackframe.
BuildMI(MBB, MBBI, dl, MFLRInst, SavedLRReg);
if (MustSaveCR && !(SingleScratchReg && MustSaveLR)) // will only occur for PPC64 @@ -1014,10 +1020,20 @@ if (MustSaveLR) { // Describe where LR was saved, at a fixed offset from CFA. unsigned Reg = MRI->getDwarfRegNum(LRReg, true);
@@ -1046,6 +1062,16 @@ for (unsigned I = 0, E = CSI.size(); I != E; ++I) { unsigned Reg = CSI[I].getReg(); if (Reg == PPC::LR || Reg == PPC::LR8 || Reg == PPC::RM) continue;
if ((Reg == SavedLRReg) && MustSaveLR)
{
unsigned lrReg = MRI->getDwarfRegNum(LRReg, true);
int Offset = MFI->getObjectOffset(CSI[I].getFrameIdx());
unsigned CFIIndex2 = MMI.addFrameInst(
MCCFIInstruction::createOffset(nullptr, lrReg, Offset));
BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
.addCFIIndex(CFIIndex2);
continue;
}
// This is a bit of a hack: CR2LT, CR2GT, CR2EQ and CR2UN are just // subregisters of CR2. We just need to emit a move of CR2. @@ -1254,10 +1280,12 @@ .addReg(TempReg, getKillRegState(i == e-1)); }
int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14; if (MustSaveLR)
BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
.addImm(LROffset)
.addReg(SPReg);
BuildMI(MBB, MBBI, dl, MTLRInst).addReg(SavedLRReg);
//BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
// .addImm(LROffset)
// .addReg(SPReg);
if (MustSaveCR && !(SingleScratchReg && MustSaveLR)) // will only occur for PPC64 @@ -1289,7 +1317,10 @@ .addReg(TempReg, getKillRegState(i == e-1));
if (MustSaveLR)
BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
//BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
BuildMI(MBB, MBBI, dl, LoadInst, SavedLRReg)
.addImm(LROffset)
.addReg(SPReg);
// Callee pop calling convention. Pop parameter/linkage area. Used for tail // call optimization
Update status:
[SU0] mfcr 12
[SU1] mflr 0
[SU2] std 0, 16(1)
[SU3] stw 12, 8(1)
[SU4] stdu 1, -176(1)
We are able to reorder these instructions in a better sequency. Our current methods:
Move stack allocation [SU4] to the end of basic block (before terminate instruction), because we need more space to be able to reorder other instructions.
Add 'mflr' scheduling data in PPCScheduleP8.td, we told scheduler that 'mflr' and 'mfcr' use the same resource, and both occupy execution pipe 4 cycles, so post-RA-sched will try to separate them.
Set 'mflr' to "no side effect" property, so scheduler can reorder it freely.
Add memory operand information for [SU2], [SU3], so scheduler can reorder them without worrying about memory dependency.
By the way, happy new year 2016 @@ ya
There are 2 issues in this llvm generated code:
PPCFrameLowering.cpp said: "// FIXME: In the ELFv2 ABI, we are not required to save all CR fields. // If only one or two CR fields are clobbered, it could be more // efficient to use mfocrf to selectively save just those fields."
So we are going to fix it.
About issue 2, we are trying to use llvm scheduling mechanism. We are investigating how to do this, any recommendation please let us know.
For move-to case, I noticed only lr register, it is used immediately for return address, cr is usually not used immediately. I can't think out any other spr that is used so frequently.
We've discussed similar code sequences, but in the epilogue code (i.e., using mtlr). Have you seen any examples for the move to spr instructions causing stalls also?
Either way, I think any solution should take into account scheduling of both the move-from and move-to spr instructions.
Not sure if this is still an issue.
https://github.com/llvm/llvm-project/commit/eb7d16ea2564 was merged to handle mflr
and its users in prologue.
Extended Description
This is a very common instruction sequence.
One example is Perl_sv_setsv_flags from perlbench. On power8, llvm generated Perl_sv_setsv_flags consumes 4.26% of run time, gcc generated function consumes 3.5% of run time.
perf annotate shows following instructions and cycles of llvm generated code:
4.95 │ mfcr r12
4.18 │ mflr r0
0.07 │ std r31,-8(r1)
│ std r0,16(r1)
0.02 │ stw r12,8(r1) ...
The two instructions mfcr/mflr consume many time, these are slow instructions, but the results are used immediately, causes stalling.
For comparison following is gcc generated code:
1.63 │ mflr r0
│ cmpld cr7,r4,r3
1.57 │ std r30,-16(r1)
0.01 │ std r31,-8(r1)
1.66 │ mfocrf r12,8
│ std r26,-48(r1)
│ std r27,-40(r1)
0.02 │ mr r31,r3
│ mr r30,r4
│ std r28,-32(r1)
2.64 │ std r29,-24(r1)
│ stw r12,8(r1)
│ std r0,16(r1)
...
It has much better scheduling of mflr/mfocrf and corresponding usage instructions.