Closed GoogleCodeExporter closed 9 years ago
Actually, it might be even better to force the mtctr and bctrl into separate
dispatch groups as the attached file from LLVM demonstrates.
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 2:59
Attachments:
I think the original idea is most optimal, maybe with an extra nop thrown in
conditionally:
mtctr
mflr
stw
nop
bctrl
The bctrl is super hot in Shark.
Plus, we should reschedule all branches to match this. We can't put the mtctr
before r0 is loaded, but we can split mtctr and bctrl into separate dispatch
groups.
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 3:06
Actually, nop nop so that bctrl is definitely in a different group because it
might take the fifth slot.
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 3:08
breakdown with v8/run.js. 66.3% of the time in this routine is sucked up in the
Veneer bctrl!!!
21.7% 0.6% 0x423108 mflr r0 3:1 ! Inline, Stall=2
0x42310c stw r0,124(r1) 3:1
34.8% 0.9% 0x423110 mtctr r12 3:1 Stall=2
66.3% 0x423114 bctrl 2:1
13.0% 31.4% 0x423118 lwz r0,124(r1) 3:1 Stall=2
26.1% 0.7% 0x42311c mtlr r0 3:1 Stall=2
4.3% 0.1% 0x423120 blr 2:1
3.4% of the total runtime in JS is occupied by the Veneer and related stub
calls, which is the vast majority of the time spent in JIT.
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 3:17
2% improvement in V8 simply by rescheduling. Optimal was:
; Prepare to call r12.
mtctr r12
; Stash LR in the reserved spot in the VMFrame. (second group)
mflr r0
stw r0, 124(r1)
#if defined(_PPC970_)
; Keep bctrl away from mtctr! This appears to be the optimal scheduling.
; If they are together, G5 pays a huge penalty, more than other SPRs.
; It actually got worse with two nops, and putting the stw with bctrl.
nop
nop
nop
#endif
; Branch. (third group)
bctrl
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 4:03
The next step is to change BaseAssembler so that instead of using
getFallibleCallTarget, it gets an optimized sequence for a veneer call:
li32 r0, JaegerStubVeneer
mtctr r0 << new dispatch group
li32 r12, stub
nop
nop
bctrl
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 5:03
We couldn't solve that due to ARM's same problem (we can't tell what we're
patching).
Changing the stanzas to separate the mtctr and bctrl degraded runtime, probably
because we inserted so many nops we turned too many near branches into far
ones. So we'll just stick to this and fixing m_call_reg() and x_baa().
Original comment by classi...@floodgap.com
on 9 Mar 2012 at 5:38
Shipping
Original comment by classi...@floodgap.com
on 11 Mar 2012 at 2:55
In this PDF about PowerPC 970FX errata I found one that may well affect us:
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/79B6E24422AA101287256E
93006C957E/$file/PowerPC_970FX_errata_DD3.X_V1.7.pdf
Erratum 20:
Possible branch prediction performance degradation for 32-bit mode applications.
In 32-bit mode applications only, the branch prediction mechanism will not
always work as intended for Branch Conditional to Link Register / Count
Register instructions (bclr[l], bcctr[l]).
This only has performance implications (no functional impact) and only affects
32-bit mode applications when performing backwards branching.
The software workaround is to clear out the upper 32-bits of the source when
writing the CTR or LR via the move-to-CTR/LR instructions in 32-bit mode only.
Original comment by Tobias.N...@gmail.com
on 4 Oct 2012 at 6:18
Original issue reported on code.google.com by
classi...@floodgap.com
on 9 Mar 2012 at 2:55