weiguozhi commented 8 years ago


Bugzilla Link	25685
Version	trunk
OS	Linux
CC	@echristo,@ecnelises,@hfinkel

Extended Description

This is a very common instruction sequence.

One example is Perl_sv_setsv_flags from perlbench. On power8, llvm generated Perl_sv_setsv_flags consumes 4.26% of run time, gcc generated function consumes 3.5% of run time.

perf annotate shows following instructions and cycles of llvm generated code:

  │      00000000100c67d0 <Perl_sv_setsv_flags>:                                                                                                                                                                            
   │        addis  r2,r12,26                                                                                                                                                                                                 
   │        addi   r2,r2,31576

4.95 │ mfcr r12
4.18 │ mflr r0
0.07 │ std r31,-8(r1)
│ std r0,16(r1)
0.02 │ stw r12,8(r1) ...

The two instructions mfcr/mflr consume many time, these are slow instructions, but the results are used immediately, causes stalling.

For comparison following is gcc generated code:

   │      00000000100d02f0 <Perl_sv_setsv_flags>:                                                                                                                                                                            
   │        addis  r2,r12,24                                                                                                                                                                                                 
   │        addi   r2,r2,21112

1.63 │ mflr r0
│ cmpld cr7,r4,r3
1.57 │ std r30,-16(r1)
0.01 │ std r31,-8(r1)
1.66 │ mfocrf r12,8
│ std r26,-48(r1)
│ std r27,-40(r1)
0.02 │ mr r31,r3
│ mr r30,r4
│ std r28,-32(r1)
2.64 │ std r29,-24(r1)
│ stw r12,8(r1)
│ std r0,16(r1)
...

It has much better scheduling of mflr/mfocrf and corresponding usage instructions.

llvmbot commented 8 years ago

Hi Carrot,

Just let you know we are still discussing this issue internally. And it might take times to have conclusion.

CY

llvmbot commented 8 years ago

I agree your comment.

I did some experiments on 400.perlbench, and I found "-fno-strict-aliasing" may be the main reason that cause current llvm generates slower code than gcc.

In my case, I found there were extra flags (EXTRA_CFLAGS) used by 400.perlbench: -fgnu89-inline -fno-strict-aliasing

I removed "-fno-strict-aliasing" when compiling by llvm and I got 1% performance improvement.

My experiments: Base options: -m64 -O3 -mcpu=power8 -funroll-loops -mno-vsx -ffast-math -mrecip=!divd -fomit-frame-pointer

With -fno-strict-aliasing: clang-3.9.0 (r266153) = 557.33 sec at9.0 (gcc5.2.1) = 549.33 sec

w/o -fno-strict-aliasing: clang-3.9.0 (r266153) = 551.33 sec at9.0 (gcc5.2.1) = 563.67 sec

So, looks like "-fno-strict-aliasing" is the root cause. Did you also use "-fno-strict-aliasing" in EXTRA_CFLAGS to compile 400.perlbench?

weiguozhi commented 8 years ago

Hi Carrot,

Sorry, you are right! The performance is not improved as you mentioned. It's a frustrating result. We need some internal discussion for our next step. I'm afraid software instruction scheduling in prologue/epilogue is not helpful for such kind of out-of-order machine :(

CY

I guess the problem is in a typical prolog/epilog most of the instructions are low latency instructions except mflr, it means the instructions before mflr can finish execution and retirement quickly, instructions after mflr must wait for the retirement of mflr since retirement must be in order. So the latency of mflr can't be hidden by scheduling.

llvmbot commented 8 years ago

My llvm/clang revision is 263000.

Our production machine doesn't have cpufreq-set. So I can only run the benchmarks multiple times to get a stable result.

Ok, I will re-run experiments and check performance score again.

Hi Carrot,

Sorry, you are right! The performance is not improved as you mentioned. It's a frustrating result. We need some internal discussion for our next step. I'm afraid software instruction scheduling in prologue/epilogue is not helpful for such kind of out-of-order machine :(

CY

llvmbot commented 8 years ago

My llvm/clang revision is 263000.

Our production machine doesn't have cpufreq-set. So I can only run the benchmarks multiple times to get a stable result.

Ok, I will re-run experiments and check performance score again.

weiguozhi commented 8 years ago

Hi Carrot,

Could you please give me your llvm/clang revision?

Also, how did you test performance, e.g. for us, we will bind the running on physical cpu0, and set cpu0 to performance governor:

sudo cpufreq-set -c 0 -g performance taskset -c 0 runspec ...

// after the test, reset cpu0 to ondemand governor sudo cpufreq-set -c 0 -g ondemand

CY

My llvm/clang revision is 263000.

Our production machine doesn't have cpufreq-set. So I can only run the benchmarks multiple times to get a stable result.

llvmbot commented 8 years ago

Hi CY

I tested the patch http://reviews.llvm.org/D18030 on power8, following is the execution time of several runs of mcf,
w/o      w/ the patch
222.7s 216.8s 219.5s 219.0s 217.9s 221.8s 220.0s 222.4s

It seems there is no difference.

Hi Carrot,

Could you please give me your llvm/clang revision?

Also, how did you test performance, e.g. for us, we will bind the running on physical cpu0, and set cpu0 to performance governor:

sudo cpufreq-set -c 0 -g performance taskset -c 0 runspec ...

// after the test, reset cpu0 to ondemand governor sudo cpufreq-set -c 0 -g ondemand

CY

weiguozhi commented 8 years ago

Hi CY

I tested the patch http://reviews.llvm.org/D18030 on power8, following is the execution time of several runs of mcf,

w/o      w/ the patch

222.7s 216.8s 219.5s 219.0s 217.9s 221.8s 220.0s 222.4s

It seems there is no difference.

llvmbot commented 8 years ago

Hi Carrot,

I uploaded the patch to phabricator: http://reviews.llvm.org/D18030

I re-benchmark again, and I found it benefit performance, could you download the patch and test on your machine, thanks!

I also uploaded a patch that try to use mfocrf in prologue when appropriate. mfocrf has short latency compares to mfcr, we can get additional benefit from using this instruction. http://reviews.llvm.org/D17749

CY

llvmbot commented 8 years ago

Hi Carrot,

The patch is our current working. Actually we have made it few weeks ago. This patch passed our internal full test, we have also benchmarked it.

Below is one of the example and result:

Original: With the patch (Reordered):

mflr 0 stdu 1, -240(1) mfcr 12 mflr 0 std 0, 16(1) std 14, 96(1) stw 12, 8(1) std 15, 104(1) stdu 1, -240(1) std 16, 112(1) std 14, 96(1) std 17, 120(1) std 15, 104(1) std 18, 128(1) std 16, 112(1) mfcr 12 std 17, 120(1) std 19, 136(1) std 18, 128(1) std 30, 224(1) std 19, 136(1) std 31, 232(1) std 30, 224(1) std 0, 256(1) std 31, 232(1) stw 12, 248(1)

APP #APP

NO_APP #NO_APP

bl bar bl bar nop nop ld 31, 232(1) ld 0, 256(1) ld 30, 224(1) lwz 12, 248(1) ld 19, 136(1) ld 31, 232(1) ld 18, 128(1) ld 30, 224(1) ld 17, 120(1) ld 19, 136(1) ld 16, 112(1) ld 18, 128(1) ld 15, 104(1) ld 17, 120(1) ld 14, 96(1) ld 16, 112(1) addi 1, 1, 240 ld 15, 104(1) ld 0, 16(1) ld 14, 96(1) lwz 12, 8(1) mtocrf 32, 12 mtocrf 32, 12 mtlr 0 mtlr 0 addi 1, 1, 240 blr blr

Our SPEC2006-INT/FP benchmark result didn't show obvious performance improvement, that's why we hesitated about whether to fire the patch or not.

It looks like for out-of-order cpu which has big instruction window, plenty execution pipe, plenty issue queue can decrease software instruction shceduler's effort.

http://users.elis.ugent.be/~leeckhou/papers/hipeac08-eyerman.pdf

Anyway, at least this patch relax instruction order constraint in prologue and epilogue, we will upload it later.

CY

llvmbot commented 8 years ago

Let post-RA-scheduler have more space to order instructions

weiguozhi commented 8 years ago

A typical function prloluge/epilogue on ppc64 looks like

mflr r0 // P1 std r0,16(r1) // P2 ... ld r0, 16(r1) // E1 mtlr r0 // E2 blr // E3

P2 depends on P1, E2 depends E1 and E3 depends on E2. Following code sequence can completely remove the dependences (P2, P1) (E2, E1),

std r14, 16(r1) mflr r14 ... mtlr r14 ld r14, 16(r1) blr

One problem in this code sequence is the stack frame layout is changed, it can impact unwinding (and exceptions). Correct dwarf information can solve this problem(at least in spec2006).

I tested the following patch on spec2006 c/c++ applications. Unfortunately it doesn't bring obvious performance improvement. Only 3~4s improvement for astar, it is less than 1%, no changes to other applications.

Index: lib/Target/PowerPC/PPCFrameLowering.cpp

--- lib/Target/PowerPC/PPCFrameLowering.cpp (revision 262420) +++ lib/Target/PowerPC/PPCFrameLowering.cpp (working copy) @@ -851,8 +851,13 @@ .addReg(SPReg); }

int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14; if (MustSaveLR)
BuildMI(MBB, MBBI, dl, MFLRInst, ScratchReg);
//BuildMI(MBB, MBBI, dl, MFLRInst, ScratchReg);
BuildMI(MBB, MBBI, dl, StoreInst)
.addReg(SavedLRReg)
.addImm(LROffset)
.addReg(SPReg);

if (MustSaveCR && !(SingleScratchReg && MustSaveLR)) { // will only occur for PPC64 @@ -888,10 +893,11 @@

if (MustSaveLR) // FIXME: On PPC32 SVR4, we must not spill before claiming the stackframe.
BuildMI(MBB, MBBI, dl, StoreInst)
.addReg(ScratchReg)
.addImm(LROffset)
.addReg(SPReg);
//BuildMI(MBB, MBBI, dl, StoreInst)
// .addReg(ScratchReg)
// .addImm(LROffset)
// .addReg(SPReg);
BuildMI(MBB, MBBI, dl, MFLRInst, SavedLRReg);

if (MustSaveCR && !(SingleScratchReg && MustSaveLR)) // will only occur for PPC64 @@ -1014,10 +1020,20 @@ if (MustSaveLR) { // Describe where LR was saved, at a fixed offset from CFA. unsigned Reg = MRI->getDwarfRegNum(LRReg, true);
CFIIndex = MMI.addFrameInst(
MCCFIInstruction::createOffset(nullptr, Reg, LROffset));
unsigned gpr = MRI->getDwarfRegNum(SavedLRReg, true);
unsigned CFIIndex1 = MMI.addFrameInst(
MCCFIInstruction::createOffset(nullptr, gpr, LROffset)); BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
.addCFIIndex(CFIIndex);
.addCFIIndex(CFIIndex1);
unsigned CFIIndex2 = MMI.addFrameInst(
MCCFIInstruction::createRegister(nullptr, Reg, gpr));
BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
.addCFIIndex(CFIIndex2);
//CFIIndex = MMI.addFrameInst(
// MCCFIInstruction::createOffset(nullptr, Reg, LROffset));
//BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
// .addCFIIndex(CFIIndex); } }

@@ -1046,6 +1062,16 @@ for (unsigned I = 0, E = CSI.size(); I != E; ++I) { unsigned Reg = CSI[I].getReg(); if (Reg == PPC::LR || Reg == PPC::LR8 || Reg == PPC::RM) continue;

if ((Reg == SavedLRReg) && MustSaveLR)
{
unsigned lrReg = MRI->getDwarfRegNum(LRReg, true);
int Offset = MFI->getObjectOffset(CSI[I].getFrameIdx());
unsigned CFIIndex2 = MMI.addFrameInst(
MCCFIInstruction::createOffset(nullptr, lrReg, Offset));
BuildMI(MBB, MBBI, dl, TII.get(TargetOpcode::CFI_INSTRUCTION))
.addCFIIndex(CFIIndex2);
continue;
}

// This is a bit of a hack: CR2LT, CR2GT, CR2EQ and CR2UN are just // subregisters of CR2. We just need to emit a move of CR2. @@ -1254,10 +1280,12 @@ .addReg(TempReg, getKillRegState(i == e-1)); }
int SavedLRReg = isPPC64 ? PPC::X14 : PPC::R14; if (MustSaveLR)
BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
.addImm(LROffset)
.addReg(SPReg);
BuildMI(MBB, MBBI, dl, MTLRInst).addReg(SavedLRReg);
//BuildMI(MBB, MBBI, dl, LoadInst, ScratchReg)
// .addImm(LROffset)
// .addReg(SPReg);

if (MustSaveCR && !(SingleScratchReg && MustSaveLR)) // will only occur for PPC64 @@ -1289,7 +1317,10 @@ .addReg(TempReg, getKillRegState(i == e-1));

if (MustSaveLR)
BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
//BuildMI(MBB, MBBI, dl, MTLRInst).addReg(ScratchReg);
BuildMI(MBB, MBBI, dl, LoadInst, SavedLRReg)
.addImm(LROffset)
.addReg(SPReg);

// Callee pop calling convention. Pop parameter/linkage area. Used for tail // call optimization

llvmbot commented 8 years ago

Update status:

[SU0] mfcr 12
[SU1] mflr 0
[SU2] std 0, 16(1)
[SU3] stw 12, 8(1)
[SU4] stdu 1, -176(1)

We are able to reorder these instructions in a better sequency. Our current methods:

Move stack allocation [SU4] to the end of basic block (before terminate instruction), because we need more space to be able to reorder other instructions.
Add 'mflr' scheduling data in PPCScheduleP8.td, we told scheduler that 'mflr' and 'mfcr' use the same resource, and both occupy execution pipe 4 cycles, so post-RA-sched will try to separate them.
Set 'mflr' to "no side effect" property, so scheduler can reorder it freely.
Add memory operand information for [SU2], [SU3], so scheduler can reorder them without worrying about memory dependency.

By the way, happy new year 2016 @@ ya

llvmbot commented 8 years ago

There are 2 issues in this llvm generated code:

We should use "mfocrf" instead of "mfcr". In our test pattern, replace mfcr by mfocrf can have overall 10% performance improvements.
We should separate mfcr and mflr by 2 ~ 4 instructions, if we use mfocrf, it should be 2 instructions.

PPCFrameLowering.cpp said: "// FIXME: In the ELFv2 ABI, we are not required to save all CR fields. // If only one or two CR fields are clobbered, it could be more // efficient to use mfocrf to selectively save just those fields."

So we are going to fix it.

About issue 2, we are trying to use llvm scheduling mechanism. We are investigating how to do this, any recommendation please let us know.

weiguozhi commented 8 years ago

For move-to case, I noticed only lr register, it is used immediately for return address, cr is usually not used immediately. I can't think out any other spr that is used so frequently.

llvmbot commented 8 years ago

We've discussed similar code sequences, but in the epilogue code (i.e., using mtlr). Have you seen any examples for the move to spr instructions causing stalls also?

Either way, I think any solution should take into account scheduling of both the move-from and move-to spr instructions.

chenzheng1030 commented 9 months ago

Not sure if this is still an issue.

https://github.com/llvm/llvm-project/commit/eb7d16ea2564 was merged to handle mflr and its users in prologue.

llvm / llvm-project

[ppc] bad scheduling for special register access instructions #26059

Extended Description

APP #APP

NO_APP #NO_APP

Index: lib/Target/PowerPC/PPCFrameLowering.cpp