Open pattop opened 3 years ago
I managed to get someone to confirm this on e200z6: lwz does not zero the high 32 bits.
I can verify this on e200z6 and e200z759n3 hardware. Unfortunately I don't have access to a real e500 machine.
Due to covid restrictions I can't personally do this for at least a week, but I'll see if I can get someone else to do a quick check.
This passes on QEMU's ppce500 machine:
spe_lwz_test: stwu 1, -16(1) li 3, 0 stw 3, 8(1) li 3, 0xab li 4, 0xcd evmergelo 5, 3, 4 / r5(0:63) = r3(32:63)r4(32:63) = 0x000000ab000000cd / lwz 5, 8(1) / should load 0 to low part of r5 = 0x000000ab00000000 / evmergehi 6, 5, 5 / r6(0:63) = r5(0:31)r5(0:31) = 0x000000ab0x000000ab / twne 6, 3 / assert(r6(32:63) == r3(32:63)) / addi 1, 1, 16 blr
My experience with the e500v2 core suggests that touching the lower 32 bits of the GPRs does in fact zero out the upper 32 bits. I needed to commit https://svnweb.freebsd.org/base?view=revision&revision=355045 to work around some bizarreness I saw on FreeBSD with the upper bits getting zeroed out from under a process's knowledge.
Nemanja, if Partick's observations are correct, I can make some of the changes, those to the ABI registers I think. I'm not sure how to only push the sub-registers and not the full SPE registers, if only the GPRs are touched. Is there any example of doing that already, that I can look at?
Ah, OK. Now it makes sense. Thanks for tracking that down.
Justin, do you think you can post a fix for this?
I've done some further research here.
PowerISA 2.07B allows for a 32-bit implementation to implement only bits 32:63 of the GPRs (in Appendix C).
My understanding is that cores including the SPE APU (like e200,e500) do this, so the normal powerpc instructions (including lwz) are implemented as 32-bit only.
The ISA then describes the extended SPE GPRs on 32-bit implementations in '8.3.2 GPR Registers' as follows:
"The SPE requires a GPR register file with thirty-two 64-bit registers. For 32-bit implementations, instructions that normally operate on a 32-bit register file access and change only the least significant 32-bits ofthe GPRs leaving the most significant 32-bits unchanged."
I don't think that is the case for Book-E.
See https://www.nxp.com/docs/en/reference-manual/E500CORERM.pdf 1.1.2 Core Complex Summary (page 41).
"Although the GPRs are 64 bits wide, only SPE APU, DPFP (e500v2 only), and embedded vector floating-point instructions operate on the upper word of the GPRs; the upper 32 bits are not affected by other 32-bit instructions."
Nemanja, I think the gcc code is valid.
gcc is using the 32-bit r31 to save r4 across the call to bar(). It correctly saves/restores r31 from the stack without touching the extended r31.
From the caller's perspective r31 is unaltered across the call to foo.
Linux switches the SPE register context only if it is used by a thread. Unnecessarily accessing the extended part of a gpr causes extra kernel traps and context management.
The lwz instruction clears the upper 32 bits of the target GPR. So if the ABI specifies that the full 64 bits of the register are callee-saved, this code does not achieve that. Of course, the LLVM code has an lwz after evldd so it suffers from the same problem.
This is why I mentioned that this must mean that the ABI does not specify the full register as callee-saved but just the low 32 bits.
Lets say that foo is called by main and that r31 is used as a full 64 bit register prior to the call in main. The call to foo will save the low word of r31 at address 0x8 and will restore the low half at address 0x20, clearing the upper 32 bits. If main then tries to use the upper half of r31, it'll just get zeros. So the upper half wasn't saved by the call - it is thereby caller saved rather than callee saved.
Nemanja, I think the gcc code is valid.
gcc is using the 32-bit r31 to save r4 across the call to bar(). It correctly saves/restores r31 from the stack without touching the extended r31.
From the caller's perspective r31 is unaltered across the call to foo.
Linux switches the SPE register context only if it is used by a thread. Unnecessarily accessing the extended part of a gpr causes extra kernel traps and context management.
Justin Hibbits will probably know much more about this than I do so I added him to the CC list. Sorry that I missed this until now.
Justin, I don't really know much about SPE and the ABI it uses, but it is quite clear that what we are doing with these spills is not reasonable:
- We spill r30 to two different stack slots using STW and EVSTDD
- We reload it from both slots with the LWZ clobbering the EVLDD
I assume that this is because S30 is a super register of R30 so when we clobber the subregiter, we clobber the super register as well. Then we mark both for spilling.
This can probably be fixed up in PPCFrameLowering::determineCalleeSaves() so we only spill one. Or CSR_SVR432_SPE can be changed to only have the one we want to save/restore.
Nemanja, you're right it's extremely suboptimal. Someone (I forget who) had submitted a review (committed?) to elide sub-register saves, only saving the full register, which would eliminate the problems you listed.
Regarding saving/restoring the super-register S30 and S29, I agree at least S30 should be cleaned up and only R30 saved, per the ABI. R29 is a different story, since it's conditional on isPositionIndependent() and isSVR4ABI(), but given those conditions it should be limited.
Justin Hibbits will probably know much more about this than I do so I added him to the CC list. Sorry that I missed this until now.
Justin, I don't really know much about SPE and the ABI it uses, but it is quite clear that what we are doing with these spills is not reasonable:
I assume that this is because S30 is a super register of R30 so when we clobber the subregiter, we clobber the super register as well. Then we mark both for spilling.
This can probably be fixed up in PPCFrameLowering::determineCalleeSaves() so we only spill one. Or CSR_SVR432_SPE can be changed to only have the one we want to save/restore.
However, what I find kind of questionable is how the GCC code listed here is valid. In LLVM, we mark the 64-bit SPE registers as callee-saved which means they need to be spilled/reloaded in the prologue/epilogue. The GCC implementation on the other hand seems to only save/restore the 32-bit subregister.
Ultimately, foo here clobbers callee-saved registers here and it is a question of the ABI as to whether the full 64-bit register is callee-saved or just the 32-bit subregister.
Ping? This is a showstopper for performance critical code.
This causes significant performance regressions for code written to use embedded float scalar operations as Linux traps on the evstdd and marks the process as using SPE.
Extended Description
For the following test case:
extern float bar(float); float foo(float a, float b) { return bar(a) * b; }
GCC 8.4.0 generates
00000000:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 7c 08 02 a6 mflr r0
8: 93 e1 00 0c stw r31,12(r1)
c: 7c 9f 23 78 mr r31,r4
10: 90 01 00 14 stw r0,20(r1)
14: 48 00 00 01 bl 14 <foo+0x14>
18: 80 01 00 14 lwz r0,20(r1)
1c: 10 63 fa c8 efsmul r3,r3,r31
20: 83 e1 00 0c lwz r31,12(r1)
24: 38 21 00 10 addi r1,r1,16
28: 7c 08 03 a6 mtlr r0
2c: 4e 80 00 20 blr
LLVM 12.0.0 generates
00000000:
0: 7c 08 02 a6 mflr r0
4: 90 01 00 04 stw r0,4(r1)
8: 94 21 ff d0 stwu r1,-48(r1)
c: 93 c1 00 28 stw r30,40(r1)
10: 93 a1 00 24 stw r29,36(r1)
14: 13 a1 0b 21 evstdd r29,8(r1)
18: 7c 9d 23 78 mr r29,r4
1c: 13 c1 13 21 evstdd r30,16(r1)
20: 48 00 00 05 bl 24 <foo+0x24>
24: 7f c8 02 a6 mflr r30
28: 3f de 00 00 addis r30,r30,0
2c: 3b de 00 00 addi r30,r30,0
30: 48 00 00 01 bl 30 <foo+0x30>
34: 10 63 ea c8 efsmul r3,r3,r29
38: 13 c1 13 01 evldd r30,16(r1)
3c: 13 a1 0b 01 evldd r29,8(r1)
40: 83 a1 00 24 lwz r29,36(r1)
44: 80 01 00 34 lwz r0,52(r1)
48: 83 c1 00 28 lwz r30,40(r1)
4c: 38 21 00 30 addi r1,r1,48
50: 7c 08 03 a6 mtlr r0
54: 4e 80 00 20 blr
It looks like the biggest problem is that LLVM is unnecessarily storing the extended (64-bit) versions of r29 and r30.