LLVM 12 generates poor PowerPC-SPE code

pattop commented 3 years ago


Bugzilla Link	50088
Version	trunk
OS	Linux
CC	@chmeeedalf,@nemanjai

Extended Description

For the following test case:

extern float bar(float); float foo(float a, float b) { return bar(a) * b; }

GCC 8.4.0 generates

00000000 : 0: 94 21 ff f0 stwu r1,-16(r1) 4: 7c 08 02 a6 mflr r0 8: 93 e1 00 0c stw r31,12(r1) c: 7c 9f 23 78 mr r31,r4 10: 90 01 00 14 stw r0,20(r1) 14: 48 00 00 01 bl 14 <foo+0x14> 18: 80 01 00 14 lwz r0,20(r1) 1c: 10 63 fa c8 efsmul r3,r3,r31 20: 83 e1 00 0c lwz r31,12(r1) 24: 38 21 00 10 addi r1,r1,16 28: 7c 08 03 a6 mtlr r0 2c: 4e 80 00 20 blr

LLVM 12.0.0 generates

00000000 : 0: 7c 08 02 a6 mflr r0 4: 90 01 00 04 stw r0,4(r1) 8: 94 21 ff d0 stwu r1,-48(r1) c: 93 c1 00 28 stw r30,40(r1) 10: 93 a1 00 24 stw r29,36(r1) 14: 13 a1 0b 21 evstdd r29,8(r1) 18: 7c 9d 23 78 mr r29,r4 1c: 13 c1 13 21 evstdd r30,16(r1) 20: 48 00 00 05 bl 24 <foo+0x24> 24: 7f c8 02 a6 mflr r30 28: 3f de 00 00 addis r30,r30,0 2c: 3b de 00 00 addi r30,r30,0 30: 48 00 00 01 bl 30 <foo+0x30> 34: 10 63 ea c8 efsmul r3,r3,r29 38: 13 c1 13 01 evldd r30,16(r1) 3c: 13 a1 0b 01 evldd r29,8(r1) 40: 83 a1 00 24 lwz r29,36(r1) 44: 80 01 00 34 lwz r0,52(r1) 48: 83 c1 00 28 lwz r30,40(r1) 4c: 38 21 00 30 addi r1,r1,48 50: 7c 08 03 a6 mtlr r0 54: 4e 80 00 20 blr

It looks like the biggest problem is that LLVM is unnecessarily storing the extended (64-bit) versions of r29 and r30.

pattop commented 3 years ago

I managed to get someone to confirm this on e200z6: lwz does not zero the high 32 bits.

pattop commented 3 years ago

I can verify this on e200z6 and e200z759n3 hardware. Unfortunately I don't have access to a real e500 machine.

Due to covid restrictions I can't personally do this for at least a week, but I'll see if I can get someone else to do a quick check.

This passes on QEMU's ppce500 machine:

spe_lwz_test: stwu 1, -16(1) li 3, 0 stw 3, 8(1) li 3, 0xab li 4, 0xcd evmergelo 5, 3, 4 / r5(0:63) = r3(32:63)r4(32:63) = 0x000000ab000000cd / lwz 5, 8(1) / should load 0 to low part of r5 = 0x000000ab00000000 / evmergehi 6, 5, 5 / r6(0:63) = r5(0:31)r5(0:31) = 0x000000ab0x000000ab / twne 6, 3 / assert(r6(32:63) == r3(32:63)) / addi 1, 1, 16 blr

chmeeedalf commented 3 years ago

My experience with the e500v2 core suggests that touching the lower 32 bits of the GPRs does in fact zero out the upper 32 bits. I needed to commit https://svnweb.freebsd.org/base?view=revision&revision=355045 to work around some bizarreness I saw on FreeBSD with the upper bits getting zeroed out from under a process's knowledge.

Nemanja, if Partick's observations are correct, I can make some of the changes, those to the ABI registers I think. I'm not sure how to only push the sub-registers and not the full SPE registers, if only the GPRs are touched. Is there any example of doing that already, that I can look at?

nemanjai commented 3 years ago

Ah, OK. Now it makes sense. Thanks for tracking that down.

Justin, do you think you can post a fix for this?

pattop commented 3 years ago

I've done some further research here.

PowerISA 2.07B allows for a 32-bit implementation to implement only bits 32:63 of the GPRs (in Appendix C).

My understanding is that cores including the SPE APU (like e200,e500) do this, so the normal powerpc instructions (including lwz) are implemented as 32-bit only.

The ISA then describes the extended SPE GPRs on 32-bit implementations in '8.3.2 GPR Registers' as follows:

"The SPE requires a GPR register file with thirty-two 64-bit registers. For 32-bit implementations, instructions that normally operate on a 32-bit register file access and change only the least significant 32-bits ofthe GPRs leaving the most significant 32-bits unchanged."

pattop commented 3 years ago

I don't think that is the case for Book-E.

See https://www.nxp.com/docs/en/reference-manual/E500CORERM.pdf 1.1.2 Core Complex Summary (page 41).

"Although the GPRs are 64 bits wide, only SPE APU, DPFP (e500v2 only), and embedded vector floating-point instructions operate on the upper word of the GPRs; the upper 32 bits are not affected by other 32-bit instructions."

nemanjai commented 3 years ago

Nemanja, I think the gcc code is valid.

gcc is using the 32-bit r31 to save r4 across the call to bar(). It correctly saves/restores r31 from the stack without touching the extended r31.

From the caller's perspective r31 is unaltered across the call to foo.

Linux switches the SPE register context only if it is used by a thread. Unnecessarily accessing the extended part of a gpr causes extra kernel traps and context management.

The lwz instruction clears the upper 32 bits of the target GPR. So if the ABI specifies that the full 64 bits of the register are callee-saved, this code does not achieve that. Of course, the LLVM code has an lwz after evldd so it suffers from the same problem.

This is why I mentioned that this must mean that the ABI does not specify the full register as callee-saved but just the low 32 bits.

Lets say that foo is called by main and that r31 is used as a full 64 bit register prior to the call in main. The call to foo will save the low word of r31 at address 0x8 and will restore the low half at address 0x20, clearing the upper 32 bits. If main then tries to use the upper half of r31, it'll just get zeros. So the upper half wasn't saved by the call - it is thereby caller saved rather than callee saved.

pattop commented 3 years ago

Nemanja, I think the gcc code is valid.

gcc is using the 32-bit r31 to save r4 across the call to bar(). It correctly saves/restores r31 from the stack without touching the extended r31.

From the caller's perspective r31 is unaltered across the call to foo.

Linux switches the SPE register context only if it is used by a thread. Unnecessarily accessing the extended part of a gpr causes extra kernel traps and context management.

chmeeedalf commented 3 years ago

Justin Hibbits will probably know much more about this than I do so I added him to the CC list. Sorry that I missed this until now.

Justin, I don't really know much about SPE and the ABI it uses, but it is quite clear that what we are doing with these spills is not reasonable:

We spill r30 to two different stack slots using STW and EVSTDD

We reload it from both slots with the LWZ clobbering the EVLDD

I assume that this is because S30 is a super register of R30 so when we clobber the subregiter, we clobber the super register as well. Then we mark both for spilling.

This can probably be fixed up in PPCFrameLowering::determineCalleeSaves() so we only spill one. Or CSR_SVR432_SPE can be changed to only have the one we want to save/restore.

Nemanja, you're right it's extremely suboptimal. Someone (I forget who) had submitted a review (committed?) to elide sub-register saves, only saving the full register, which would eliminate the problems you listed.

Regarding saving/restoring the super-register S30 and S29, I agree at least S30 should be cleaned up and only R30 saved, per the ABI. R29 is a different story, since it's conditional on isPositionIndependent() and isSVR4ABI(), but given those conditions it should be limited.

nemanjai commented 3 years ago

Justin Hibbits will probably know much more about this than I do so I added him to the CC list. Sorry that I missed this until now.

Justin, I don't really know much about SPE and the ABI it uses, but it is quite clear that what we are doing with these spills is not reasonable:

We spill r30 to two different stack slots using STW and EVSTDD
We reload it from both slots with the LWZ clobbering the EVLDD

I assume that this is because S30 is a super register of R30 so when we clobber the subregiter, we clobber the super register as well. Then we mark both for spilling.

This can probably be fixed up in PPCFrameLowering::determineCalleeSaves() so we only spill one. Or CSR_SVR432_SPE can be changed to only have the one we want to save/restore.

However, what I find kind of questionable is how the GCC code listed here is valid. In LLVM, we mark the 64-bit SPE registers as callee-saved which means they need to be spilled/reloaded in the prologue/epilogue. The GCC implementation on the other hand seems to only save/restore the 32-bit subregister.

Ultimately, foo here clobbers callee-saved registers here and it is a question of the ABI as to whether the full 64-bit register is callee-saved or just the 32-bit subregister.

pattop commented 3 years ago

Ping? This is a showstopper for performance critical code.

pattop commented 3 years ago

This causes significant performance regressions for code written to use embedded float scalar operations as Linux traps on the evstdd and marks the process as using SPE.

llvm / llvm-project

LLVM 12 generates poor PowerPC-SPE code #49432

Extended Description