Open llvmbot opened 11 years ago
A fairly simple experiment can be made by creating union register classes containing both GPR and SSE registers for 32-bit and 64-bit values.
Also please note what I wrote in the original ticket, this should be able to go both ways ('stash an SSE vector into two MMX or x64 registers')...
Note that in the MSVC ticket I wrote that I already tried this approach manually and actually got a speed up with MSVC...I then tried enabling the same thing for Clang builds but it made things worse because Clang generated bad code for it (did things like sending the MMX registers to the stack)...
Separate ticket for the Clang codegen problem: http://llvm.org/bugs/show_bug.cgi?id=17548.
Forget about MMX, SSE is the future.
What about the present? :D And the amount of time we'll be stuck with 32 bit x86? :/
Again, this is not about MMX it is about squeezing the most out of an ISA for complex algorithms. Think of this in terms of algorithms like the FFT where you want to avoid going to memory for as long as you possibly can (you can never have too many registers for the FFT :D)... Obviously there's a significant amount of such problems that require large register sets (http://en.wikipedia.org/wiki/AltiVec#VMX128)...
This is related the the 'register bank selection' I talked about in the recent global instruction selector proposal. See http://2pi.dk/llvm/global-isel.html
Wow, so this has already been discussed? Great :)
A fairly simple experiment can be made by creating union register classes containing both GPR and SSE registers for 32-bit and 64-bit values. The register allocator's 'register class inflation' will use these classes when permitted.
By "can be made" do you mean that one can already 'trick' Clang/LLVM into doing this by using unions of builtin scalar and types to store such values?
I actually ran this experiment a while back and didn't get impressive results. As Ben mentions, store forwarding is extremely efficient on new Intel chips.
But it too can be exhausted/is limited by the length of the pipeline and probably other things...
Note that in the MSVC ticket I wrote that I already tried this approach manually and actually got a speed up with MSVC...I then tried enabling the same thing for Clang builds but it made things worse because Clang generated bad code for it (did things like sending the MMX registers to the stack)...
Spilling to MMX/SSE registers would be an interesting thing to explore
MMX and SSE were only an example (I personally explored), it applies in general (GP, VFP, NEON...) but yes it does matter more for poor man's x86...
- If LLVM wants to use MMX registers it has to insert "emms" before any function call and at the end of the function, otherwise x87 code in other functions will trap. emms isn't cheap.
- Moving between GPR and SSE registers comes with a penalty on modern CPUs. Memory spills on the other hand can be handled through store forwarding in some cases. It's difficult to say when this is a win and when not.
With the danger of oversimplifying, wouldn't a cost function (based on target CPU; amount, granularity and alignment of required spillage; etc.) 'solve' this...
Forget about MMX, SSE is the future.
This is related the the 'register bank selection' I talked about in the recent global instruction selector proposal. See http://2pi.dk/llvm/global-isel.html
A fairly simple experiment can be made by creating union register classes containing both GPR and SSE registers for 32-bit and 64-bit values. The register allocator's 'register class inflation' will use these classes when permitted.
I actually ran this experiment a while back and didn't get impressive results. As Ben mentions, store forwarding is extremely efficient on new Intel chips.
Spilling to MMX/SSE registers would be an interesting thing to explore, it's not without problems though.
If LLVM wants to use MMX registers it has to insert "emms" before any function call and at the end of the function, otherwise x87 code in other functions will trap. emms isn't cheap.
Moving between GPR and SSE registers comes with a penalty on modern CPUs. Memory spills on the other hand can be handled through store forwarding in some cases. It's difficult to say when this is a win and when not.
Extended Description
If something like this would be feasible in LLVM: https://connect.microsoft.com/VisualStudio/feedback/details/804679/msvc-implement-register-to-register-spill