simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.37k stars 247 forks source link

enable SVML for VS2019 #906

Closed xinsun4atlas closed 1 year ago

xinsun4atlas commented 3 years ago

Hello, SIMDe only enables SVML when using intel compiler. Recent VS2019 comes with SVML as well, so it will be good to enable on that too. Thanks!

xinsun4atlas commented 3 years ago

Although I can enable SVML on VS2019 by predefine SIMDE_ARCH_X86_SVML, however, it will automatically enable AVX512 intrinsics in svml.h. The compile will fail if AVX512 are not available. It will be great if SVML intrinsics can be partially enabled according to the SSE/AVX settings, that only supported intrinsics are used.

tycho commented 3 years ago

I ran into this too. I don't know why SIMDE_X86_SVML_NATIVE is hardcoded to imply SIMDE_X86_AVX512F_NATIVE, since there are plenty of library functions in SVML that don't depend on any AVX code.

Additionally, even if you comment out the implicit AVX512F_NATIVE and you are targeting something without AVX, then you'll need to do this as well, because the _NATIVE feature tests for these two functions are wrong:

index 81509e96..d10dcf3f 100644
--- a/simde/x86/svml.h
+++ b/simde/x86/svml.h
@@ -8825,7 +8825,7 @@ simde_mm_clog_ps (simde__m128 a) {
 SIMDE_FUNCTION_ATTRIBUTES
 simde__m256
 simde_mm256_clog_ps (simde__m256 a) {
-  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE)
+  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_AVX_NATIVE)
     return _mm256_clog_ps(a);
   #else
     simde__m256_private
@@ -8880,7 +8880,7 @@ simde_mm_csqrt_ps (simde__m128 a) {
 SIMDE_FUNCTION_ATTRIBUTES
 simde__m256
 simde_mm256_csqrt_ps (simde__m256 a) {
-  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE)
+  #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_AVX_NATIVE)
     return _mm256_csqrt_ps(a);
   #else
     simde__m256_private
mr-c commented 1 year ago

I would happily receive a PR to fix this

mr-c commented 1 year ago

FYI, I tried this in https://github.com/simd-everywhere/simde/commit/7720b1de544c2900c4266027bb3b927ec177d4f6 and 208 tests failed: https://ci.appveyor.com/project/nemequ/simde/builds/46916083/job/82oh98b4w8ae98fm#L6413

mr-c commented 1 year ago

FYI, I tried this in 7720b1d and 208 tests failed: https://ci.appveyor.com/project/nemequ/simde/builds/46916083/job/82oh98b4w8ae98fm#L6413

@tycho @xinsun4atlas It feels like the reasons the tests fail is due to implicit enabling of AVX512 (which is not supported by the appveyor runners); can you test my branch https://github.com/simd-everywhere/simde/tree/ci/appveyor/svml on your systems?

tycho commented 1 year ago

Surely there must be a way to mark these as "skipped" tests on hardware that doesn't support AVX512? The test runner does have counters for those types of results, after all:

Ok:                 1552
Expected Fail:      0
Fail:               222
Unexpected Pass:    0
Skipped:            0
Timeout:            0

Anyway, I don't have any desktop machines with AVX512 support. I had to resort to a recent AMD Zen 4-based laptop which did support it instead:

Ok:                 1774
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            0
Timeout:            0

That felt weird, even a recent 13th Gen Intel laptop doesn't have AVX512...

mr-c commented 1 year ago

@tycho Which /arch: flags did you build with in each scenario?

If you didn't use any, then that would confirm that there seems to be AVX512 opcodes emitted by the MSVC compiler when SVML intrinsics are being used. (Or the SVML library it links against is using AVX512, see the discussion in the comments at https://stackoverflow.com/a/69418832 )

tycho commented 1 year ago

MSVC is a bit dumb and allows you to use intrinsics from any x86 feature set without any special /arch: flags.

mr-c commented 1 year ago

Surely there must be a way to mark these as "skipped" tests on hardware that doesn't support AVX512?

Correct, we skip running the tests completely in CI when we use /arch:AVX or higher:

https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/.appveyor.yml#L98-L105

But if a user of SIMDe is not setting /arch:AVX512 then SIMDe shouldn't be inadvertently causing AVX512 opcodes to be used

mr-c commented 1 year ago

It will be great if SVML intrinsics can be partially enabled according to the SSE/AVX settings, that only supported intrinsics are used.

It looks like this is not possible for MSVC

mr-c commented 1 year ago

Dear @xinsun4atlas

Hello, SIMDe only enables SVML when using intel compiler. Recent VS2019 comes with SVML as well, so it will be good to enable on that too. Thanks!

Well, we enable it unconditionally for the Intel Compiler, as it used to be unique to them

https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/simde/simde-features.h#L298-L302

If your system can support the MSVC bundled SVML library, you can define SIMDE_ARCH_X86_SVML to indicate this. However it seems to use some advanced AVX intrinsics, so the resulting application/library might not be very portable.

I can't find any documentation and as you can see from the conversation here we aren't having much luck with it.

mr-c commented 1 year ago

MSVC is a bit dumb and allows you to use intrinsics from any x86 feature set without any special /arch: flags.

True, but SIMDe won't use native intrinsics just because immintrin.h was included. The test errors come from outside the SVML code, like x86/avx512/shuffle even though we don't set /arch:AVX512 so something strange is really going on

tycho commented 1 year ago

The faulting instruction in e.g. shuffle-native-c.exe is:

SIMDE_FUNCTION_ATTRIBUTES
simde__m512i
simde__m512i_from_private(simde__m512i_private v) {
00007FF76A1224F0  mov         qword ptr [rsp+8],rcx  
00007FF76A1224F5  push        rbp  
00007FF76A1224F6  sub         rsp,0A0h  
00007FF76A1224FD  lea         rbp,[rsp+60h]  
00007FF76A122502  and         rbp,0FFFFFFFFFFFFFFC0h  
  simde__m512i r;
  simde_memcpy(&r, &v, sizeof(r));
00007FF76A122506  mov         r8d,40h  
00007FF76A12250C  mov         rdx,qword ptr [&v]  
00007FF76A122514  lea         rcx,[rbp]  
00007FF76A122518  call        memcpy (07FF76A16EE56h)  
  return r;
00007FF76A12251D  vmovdqu32   zmm0,zmmword ptr [rbp]  <----
}
00007FF76A122524  add         rsp,0A0h  
00007FF76A12252B  pop         rbp  
00007FF76A12252C  ret  

So it looks like it's generating AVX512 instructions because of the simde__m512i return type. Kind of annoying.

tycho commented 1 year ago

The last comment before this typedef is informative: https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/simde/x86/avx512/types.h#L506-L537

It seems like the !defined(HEDLEY_INTEL_VERSION) check is rather suspect. Since MSVC != Intel compiler, it enables it regardless of SIMDE_X86_AVX512F_NATIVE. I think that condition should just be:

#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && defined(SIMDE_X86_AVX512F_NATIVE)

Or something like that?

tycho commented 1 year ago

To add on to that, this change does indeed make the tests pass on non-AVX512 hardware (and without additional /arch: flags) with MSVC 2019:

diff --git a/simde/x86/avx512/types.h b/simde/x86/avx512/types.h
index d4e5618a..09e8b21e 100644
--- a/simde/x86/avx512/types.h
+++ b/simde/x86/avx512/types.h
@@ -527,7 +527,7 @@ typedef union {
  *
  * As for the ICC check, unlike other compilers, merely using the
  * AVX-512 types causes ICC to generate AVX-512 instructions. */
-#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && (defined(SIMDE_X86_AVX512F_NATIVE) || !defined(HEDLEY_INTEL_VERSION))
+#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && defined(SIMDE_X86_AVX512F_NATIVE)
   typedef __m512 simde__m512;
   typedef __m512i simde__m512i;
   typedef __m512d simde__m512d;
mr-c commented 1 year ago

@tycho Can you write a test program to use `#include "simde/x86/avx512/types.h" from https://github.com/simd-everywhere/simde/tree/ci/appveyor/svml on MSVC without /arch and print the following information?

value of _MM_CMPINT_GE value of _MM_CMPINT_NLT value of SIMDE_X86_AVX512F_NATIVE value of HEDLEY_INTEL_VERSION

type of simde__m512i

value of _AVX512BF16INTRIN_H_INCLUDED value of __AVX512BF16INTRIN_H value of SIMDE_X86_AVX512BF16_NATIVE

type of simde__m256bh

value of __AVX__ value of __AVX2__ value of _MSC_VER value of __AVX512F__ value of SIMDE_ARCH_X86_AVX512F value of SIMDE_X86_AVX512F_NATIVE

mr-c commented 1 year ago

The last comment before this typedef is informative:

https://github.com/simd-everywhere/simde/blob/6ce6030ae3b8decadac030b88a8a1868e809c9e0/simde/x86/avx512/types.h#L506-L537

It seems like the !defined(HEDLEY_INTEL_VERSION) check is rather suspect. Since MSVC != Intel compiler, it enables it regardless of SIMDE_X86_AVX512F_NATIVE. I think that condition should just be:

#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && defined(SIMDE_X86_AVX512F_NATIVE)

Or something like that?

Ah, I think it was meant to be && !defined(HEDLEY_INTEL_VERSION)) , AND not OR.

mr-c commented 1 year ago

hmm.. maybe not; then ICC would never get the AVX512 types..

tycho commented 1 year ago

Yeah I don't know what the condition was trying to do. I think it should just be the condition I put in the diff, because the Windows version of the Intel compiler matches MSVC behavior.

mr-c commented 1 year ago

Here's the original commit: https://github.com/simd-everywhere/simde/commit/4637a303d

mr-c commented 1 year ago

So the original thinking was that checking for defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT) was enough to detect AVX512F being available.

But Intel would still define those and emit AVX512 opcodes when any AVX512 types were used. It seems that MSVC compilers have the same problem. So maybe the answer is

diff --git a/simde/x86/avx512/types.h b/simde/x86/avx512/types.h
index d4e5618a..2df43909 100644
--- a/simde/x86/avx512/types.h
+++ b/simde/x86/avx512/types.h
@@ -527,7 +527,7 @@ typedef union {
  *
  * As for the ICC check, unlike other compilers, merely using the
  * AVX-512 types causes ICC to generate AVX-512 instructions. */
-#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && (defined(SIMDE_X86_AVX512F_NATIVE) || !defined(HEDLEY_INTEL_VERSION))
+#if (defined(_MM_CMPINT_GE) || defined(_MM_CMPINT_NLT)) && (defined(SIMDE_X86_AVX512F_NATIVE) || !(defined(HEDLEY_INTEL_VERSION) || defined(HEDLEY_MSVC_VERSION)))
   typedef __m512 simde__m512;
   typedef __m512i simde__m512i;
   typedef __m512d simde__m512d;
tycho commented 1 year ago

I think that would be wrong with clang-cl though, because that one behaves more sanely with regards to code generation for instruction sets not enabled in compiler flags.

(clang-cl calls itself MSVC, so you'd need something like !(defined(INTEL) || ((defined(MSVC) && !defined(CLANG)) or whatever it would be with the HEDLEY macros.)

mr-c commented 1 year ago

Thanks for the tip, @tycho !

Fingers crossed: https://ci.appveyor.com/project/nemequ/simde/builds/48327075/job/3lnyo84d655gnsd8 https://ci.appveyor.com/project/nemequ/simde/build/job/oxi8lp9bcy2l3n5u https://ci.appveyor.com/project/nemequ/simde/build/job/p4w61kdgj3n9mrc3

mr-c commented 1 year ago

Success!

tycho commented 1 year ago

@mr-c By the way, if you're curious, here's what clang (and clang-cl) would do on Windows with __m512 arguments and return type:

https://gcc.godbolt.org/z/qv4eMqWWa

Trying to use an AVX512 intrinsic is only allowed by clang/clang-cl if the appropriate feature is enabled too:

https://gcc.godbolt.org/z/a37vzjPa3

If only this behavior was standard for MSVC as well...

mr-c commented 1 year ago

Great to know, thank you!

If you have a bit of time, it would be helpful to add ClangCL to our testing setup. Your choice of CI provider 😊🙏

xinsun4atlas commented 1 year ago

Thanks for the fix! For some reasons we deprecated the project which was using this feature that I cannot test on my side. Appreciate the work and definitely will take a try if we get a chance in our new projects.

mr-c commented 3 months ago

@xinsun4atlas @tycho Due to issues with the SVML pow functions on 32-bit systems, we are thinking about not automatically enabling SVML on MSVC for x86 (but still enabling native SVML automatically on all x64 systems)

https://github.com/simd-everywhere/simde/issues/1111#issuecomment-2227349342

Do you have any comments on this proposal?

tycho commented 3 months ago

Seems reasonable. Has the issue been reported on devcom? Could just disable it until a version is released with the fix and limit it to specific compiler versions (or if Microsoft doesn't fix it, just leave it off for 32-bit x86).

mr-c commented 3 months ago

Seems reasonable. Has the issue been reported on devcom?

Ping @Epixu :-)

Could just disable it until a version is released with the fix and limit it to specific compiler versions (or if Microsoft doesn't fix it, just leave it off for 32-bit x86).

I agree with this plan. I don't know when I'll get to it personally, but a PR from anyone would be welcome!

Epixu commented 3 months ago

I will make a minimal repro and post an issue on devcom in a couple of days