p12tic / libsimdpp

Portable header-only C++ low level SIMD library
Boost Software License 1.0
1.24k stars 129 forks source link

Broken ___vectorcall and very bad register allocation using the VC++ compiler. #116

Open ActuallyaDeviloper opened 6 years ago

ActuallyaDeviloper commented 6 years ago

Today I was trying out whether libsimdpp would be a good fit for our project which currently makes heavy use of performance critical SSE SIMD instructions. Unfortunately while doing so, I ran into exceptionally bad code generation for the following simple test function:

__declspec(noinline) simdpp::float32x4 __vectorcall do(simdpp::float32x4 a, simdpp::float32x4 b)
{
    a = a + b;
    a = a * b;
    a = a + b;
    a = a * b;
    a = a + b;
    a = a * b;
    a = a + b;
    a = a * b;
    a = a + b;
    a = a * b;
    return a;
}

It generates this machine code in a 2015 x64 release build with default settings:

do:
00007FF72CBF1090 41 0F 28 00          movaps      xmm0,xmmword ptr [r8]  
00007FF72CBF1094 48 8B C1             mov         rax,rcx  
00007FF72CBF1097 0F 58 02             addps       xmm0,xmmword ptr [rdx]  
00007FF72CBF109A 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF109D 41 0F 59 00          mulps       xmm0,xmmword ptr [r8]  
00007FF72CBF10A1 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10A4 41 0F 28 08          movaps      xmm1,xmmword ptr [r8]  
00007FF72CBF10A8 0F 58 C8             addps       xmm1,xmm0  
00007FF72CBF10AB 0F 29 0A             movaps      xmmword ptr [rdx],xmm1  
00007FF72CBF10AE 0F 28 C1             movaps      xmm0,xmm1  
00007FF72CBF10B1 41 0F 59 00          mulps       xmm0,xmmword ptr [r8]  
00007FF72CBF10B5 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10B8 41 0F 58 00          addps       xmm0,xmmword ptr [r8]  
00007FF72CBF10BC 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10BF 41 0F 59 00          mulps       xmm0,xmmword ptr [r8]  
00007FF72CBF10C3 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10C6 41 0F 58 00          addps       xmm0,xmmword ptr [r8]  
00007FF72CBF10CA 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10CD 41 0F 59 00          mulps       xmm0,xmmword ptr [r8]  
00007FF72CBF10D1 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10D4 41 0F 58 00          addps       xmm0,xmmword ptr [r8]  
00007FF72CBF10D8 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10DB 41 0F 59 00          mulps       xmm0,xmmword ptr [r8]  
00007FF72CBF10DF 0F 29 02             movaps      xmmword ptr [rdx],xmm0  
00007FF72CBF10E2 0F 11 01             movups      xmmword ptr [rcx],xmm0  
00007FF72CBF10E5 C3                   ret

The value is apparently repeatedly written and read from the stack for no apparent reason. Note that perfect code would just consist of a series of addps and mulps instruction. Perfect code is generated if ordinary SSE intrinsics are used. Note that the end result is similar with MSVC 2017.

I believe that the problem is two folded:

I have considered also filing a bug report to the Microsoft Compiler team, but due to my past experience with their team and because the fix would break their ABI (which I believe is stable now since 2015), I have decided against it. A fix seems unlikely.

It would be great if simdpp could make less use of inheritance or find another way to mitigate the problem i.e. make __vectorcall work in a future release.

Cazadorro commented 6 years ago

file a bug report anyway?

Horus86 commented 6 years ago

After checking the assembly on my machine, i noted the same problem.

p12tic commented 6 years ago

Fixing this issue is in progress. A testing suite to inspect generated instruction counts is in progress of being developed. When finished, it should help to identify all performance issues in the generated code across all the compilers that the library supports.

peabody-korg commented 4 years ago

any progress on this or advice on how to work around this problem?