keeleysam / tenfourfox

Automatically exported from code.google.com/p/tenfourfox
0 stars 0 forks source link

Reschedule CTR-relative branching better for G5 #135

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
        0x3847cc8   mflr     r0 3:1 !   Inline, Stall=2     
        0x3847ccc   stw      r0,124(r1) 3:1             
        0x3847cd0   mtctr    r12    3:1     Stall=2     
    99.3%   0x3847cd4   bctrl       2:1             
        0x3847cd8   lwz      r0,124(r1) 3:1     Stall=2     
100.0%  0.7%    0x3847cdc   mtlr     r0 3:1     Stall=2     
        0x3847ce0   blr         2:1             

I can't believe I didn't see this before. If we reschedule this as

mtctr r12
mflr r0
stw r0,...
bctrl

we can suck the entire "prelude" into a single dispatch group.

Original issue reported on code.google.com by classi...@floodgap.com on 9 Mar 2012 at 2:55

GoogleCodeExporter commented 9 years ago
Actually, it might be even better to force the mtctr and bctrl into separate 
dispatch groups as the attached file from LLVM demonstrates.

Original comment by classi...@floodgap.com on 9 Mar 2012 at 2:59

Attachments:

GoogleCodeExporter commented 9 years ago
I think the original idea is most optimal, maybe with an extra nop thrown in 
conditionally:

mtctr
mflr
stw
nop
bctrl

The bctrl is super hot in Shark.

Plus, we should reschedule all branches to match this. We can't put the mtctr 
before r0 is loaded, but we can split mtctr and bctrl into separate dispatch 
groups.

Original comment by classi...@floodgap.com on 9 Mar 2012 at 3:06

GoogleCodeExporter commented 9 years ago
Actually, nop nop so that bctrl is definitely in a different group because it 
might take the fifth slot.

Original comment by classi...@floodgap.com on 9 Mar 2012 at 3:08

GoogleCodeExporter commented 9 years ago
breakdown with v8/run.js. 66.3% of the time in this routine is sucked up in the 
Veneer bctrl!!!

21.7%   0.6%    0x423108    mflr     r0 3:1 !   Inline, Stall=2     
        0x42310c    stw      r0,124(r1) 3:1             
34.8%   0.9%    0x423110    mtctr    r12    3:1     Stall=2     
    66.3%   0x423114    bctrl       2:1             
13.0%   31.4%   0x423118    lwz      r0,124(r1) 3:1     Stall=2     
26.1%   0.7%    0x42311c    mtlr     r0 3:1     Stall=2     
4.3%    0.1%    0x423120    blr         2:1             

3.4% of the total runtime in JS is occupied by the Veneer and related stub 
calls, which is the vast majority of the time spent in JIT. 

Original comment by classi...@floodgap.com on 9 Mar 2012 at 3:17

GoogleCodeExporter commented 9 years ago
2% improvement in V8 simply by rescheduling. Optimal was:

    ; Prepare to call r12.
    mtctr r12

    ; Stash LR in the reserved spot in the VMFrame. (second group)
    mflr r0
    stw r0, 124(r1)
#if defined(_PPC970_)
    ; Keep bctrl away from mtctr! This appears to be the optimal scheduling.
    ; If they are together, G5 pays a huge penalty, more than other SPRs.
    ; It actually got worse with two nops, and putting the stw with bctrl.
    nop
    nop
    nop
#endif

    ; Branch. (third group)
    bctrl

Original comment by classi...@floodgap.com on 9 Mar 2012 at 4:03

GoogleCodeExporter commented 9 years ago
The next step is to change BaseAssembler so that instead of using 
getFallibleCallTarget, it gets an optimized sequence for a veneer call:

li32 r0, JaegerStubVeneer
mtctr r0 << new dispatch group
li32 r12, stub
nop
nop
bctrl

Original comment by classi...@floodgap.com on 9 Mar 2012 at 5:03

GoogleCodeExporter commented 9 years ago
We couldn't solve that due to ARM's same problem (we can't tell what we're 
patching).

Changing the stanzas to separate the mtctr and bctrl degraded runtime, probably 
because we inserted so many nops we turned too many near branches into far 
ones. So we'll just stick to this and fixing m_call_reg() and x_baa().

Original comment by classi...@floodgap.com on 9 Mar 2012 at 5:38

GoogleCodeExporter commented 9 years ago
Shipping

Original comment by classi...@floodgap.com on 11 Mar 2012 at 2:55

GoogleCodeExporter commented 9 years ago
In this PDF about PowerPC 970FX errata I found one that may well affect us: 
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/79B6E24422AA101287256E
93006C957E/$file/PowerPC_970FX_errata_DD3.X_V1.7.pdf

Erratum 20:
Possible branch prediction performance degradation for 32-bit mode applications.
In 32-bit mode applications only, the branch prediction mechanism will not 
always work as intended for Branch Conditional to Link Register / Count 
Register instructions (bclr[l], bcctr[l]).
This only has performance implications (no functional impact) and only affects 
32-bit mode applications when performing backwards branching.
The software workaround is to clear out the upper 32-bits of the source when 
writing the CTR or LR via the move-to-CTR/LR instructions in 32-bit mode only.

Original comment by Tobias.N...@gmail.com on 4 Oct 2012 at 6:18