simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.32k stars 239 forks source link

Couple of build issues on 32bit MSVC v143 build with /arch:SSE #1111

Open Epixu opened 9 months ago

Epixu commented 9 months ago

Hi again, I'm experiencing build errors with the current master (e651ec360f22d5f9d7fb2c75c5025db614f78b8c) Many issues have been fixed regarding my prior issue, including many of the failing tests involving precision discrepancies. Here's the summary:

All my issues apply to MSVC v143 x86 builds - I have 5 such builds, each with a different /arch:XXX option:

1. The two build issues apply only to MSVC v143 x86 with /arch:SSE enabled

a) The first build issue involves reaching line 4774 here: https://github.com/simd-everywhere/simde/blob/e651ec360f22d5f9d7fb2c75c5025db614f78b8c/simde/x86/sse2.h#L4770-L4774

The line involves GCC syntax, it should be guarded. I can fix this build issue by doing:

simde_mm_pause (void) {
  #if defined(SIMDE_X86_SSE2_NATIVE)
    _mm_pause();
  #elif defined(SIMDE_ARCH_X86)
    #if defined(_MSC_VER)
      _mm_pause();
    #else
      __asm__ __volatile__("pause");
    #endif
  #elif defined(SIMDE_ARCH_ARM_NEON)

But not sure if _mm_pause() won't cause invalid instruction at runtime, if the SIMDE_X86_SSE2_NATIVE isn't available. Is it available on SIMDE_X86_SSE_NATIVE?

b) The second build issue involves SVML

I can circumvent it by simply guard the include of the svml.h - not sure if that is intended though. There should be fallback alternatives in SIMDe? These errors seemingly involve only 'simde__m128d' to '__m128d' conversion issues.

2. The test issues apply to all /arch options on MSVC v143 x86 builds:

They seemingly affect only pow functions precision, but my tests aren't exhaustive, so not sure.

Epixu commented 9 months ago

I was able to fix error 1.b by changing this check, which reenables native SVML for MSVC, while ignoring potential /arch:SSE: https://github.com/simd-everywhere/simde/blob/e651ec360f22d5f9d7fb2c75c5025db614f78b8c/simde/simde-features.h#L295-L299 to

#if !defined(SIMDE_X86_SVML_NATIVE) && !defined(SIMDE_X86_SVML_NO_NATIVE) && !defined(SIMDE_NO_NATIVE)
  #if defined(SIMDE_ARCH_X86) && (defined(__INTEL_COMPILER) || (HEDLEY_MSVC_VERSION_CHECK(14, 20, 0) && !defined(__clang__) && defined(SIMDE_X86_SSE2_NATIVE)))
    #define SIMDE_X86_SVML_NATIVE
  #endif
#endif

Only the __m128d (because it comes with SSE2?) is troublesome though. It probably should provide the single precision SVML functions either way, so I suppose this is not the optimal solution

mr-c commented 5 months ago

Hello @Epixu ; do you still have issues with 32bit MSVC builds using the latest SIMDe: v0.8.0?

Epixu commented 5 months ago

Hi, I will bump the version and check it in a couple of days, thanks for the notification!

Epixu commented 5 months ago

No change for this issue with the current master (517da845ac83098d06c5b0c9d8705ac8eb32a5ba) Both the build and test issues remain the same

mr-c commented 5 months ago

But not sure if _mm_pause() won't cause invalid instruction at runtime, if the SIMDE_X86_SSE2_NATIVE isn't available. Is it available on SIMDE_X86_SSE_NATIVE?

According to Intel, _mm_pause is SSE2 only: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_paus&ig_expand=4897 and Microsoft agrees: https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170

So a guard can be added to not use the GCC asm syntax on MSVC

mr-c commented 5 months ago

@Epixu can you test __asm pause; as an MSVC-friendly alternative to __asm__ __volatile__("pause");?

mr-c commented 5 months ago

With https://github.com/simd-everywhere/simde/commit/2d498c864512605cef62e2d21bec3e24091ef427

I'm seeing the following SIMDe x86 test errors on 32-bit MSVC:

 473/2002 x86/sse/native/c                        TIMEOUT        30.24s   exit status 1
../test/x86/sse.c:2094: assertion failed: r[0] ~= test_vec[i].r[0] (-836.000000 ~= -836.000000)
../test/x86/sse.c:2167: assertion failed: r[0] ~= test_vec[i].r[0] (-172.000000 ~= -172.000000)
../test/x86/sse.c:2246: assertion failed: r[0] ~= test_vec[i].r[0] (25075.000000 ~= 25075.000000)
../test/x86/sse.c:2287: assertion failed: r[0] ~= test_vec[i].r[0] (-200.000000 ~= -200.000000)
../test/x86/sse.c:2333: assertion failed: r[0] ~= e[0] (46384.000000 ~= 46384.000000)
../test/x86/sse.c:2375: assertion failed: r[0] ~= test_vec[i].r[0] (113.000000 ~= 113.000000)
../test/x86/sse.c:2522: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (50326.000000 ~= 50326.000000)
../test/x86/sse.c:2555: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (101.000000 ~= 101.000000)
../test/x86/sse.c:2899: assertion failed: r[0] ~= test_vec[i].r[0] (-3.607422 ~= -3.610000)
../test/x86/sse.c:4429: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (-609.119995 ~= -609.119995)

Which is pretty weird, the numbers look exactly the same to me except for line 2899; something about the float comparator isn't working?

SSE2, no timeout, so we get a list of the failing function tests:

|  475/2002 sse2/mm_setr_pd                       FAIL          
|  475/2002 sse2/x_mm_abs_pd                      FAIL          
|  475/2002 sse2/mm_store_pd                      FAIL          
|  475/2002 sse2/mm_min_pd                        FAIL          
|  475/2002 sse2/mm_cmpord_pd                     FAIL          
|  475/2002 sse2/mm_cvtpd_ps                      FAIL          
|  475/2002 sse2/mm_cvtpi32_pd                    FAIL          
|  475/2002 sse2/mm_div_pd                        FAIL          
|  475/2002 sse2/mm_or_pd                         FAIL   
../test/x86/sse2.c:7132: assertion failed: r[0] ~= test_vec[i].r[0] (0.740000 ~= 0.740000)
../test/x86/sse2.c:59: assertion failed: r[0] ~= simde_mm_loadu_pd(test_vec[i].r)[0] (147.280000 ~= 147.280000)
../test/x86/sse2.c:8244: assertion failed: r[0] ~= test_vec[i].r[0] (176.750000 ~= 176.750000)
../test/x86/sse2.c:5020: assertion failed: r[0] ~= test_vec[i].r[0] (-514.320000 ~= -514.320000)
../test/x86/sse2.c:2558: assertion failed: r[1] ~= simde_mm_loadu_pd(test_vec[i].r)[1] (0.000000 ~= 0.000000)
../test/x86/sse2.c:3362: assertion failed: r[0] ~= test_vec[i].r[0] (689.409973 ~= 689.409973)
../test/x86/sse2.c:3394: assertion failed: r[0] ~= test_vec[i].r[0] (-579.000000 ~= -579.000000)
../test/x86/sse2.c:4152: assertion failed: r[0] ~= test_vec[i].r[0] (1.158700 ~= 1.160000)
../test/x86/sse2.c:5919: assertion failed: r[0] ~= test_vec[i].r[0] (128062.238280 ~= 128062.240000)

Again, all visually identical, except for one (line 4152); yet still assertion failed

Epixu commented 5 months ago

Does ~= stand for some approximate comparison based on ULPs? Could these look the same due to bad print? Maybe if printed in scientific notation would be better?

I will try __asm pause in a couple of days

mr-c commented 5 months ago

Does ~= stand for some approximate comparison based on ULPs? Could these look the same due to bad print? Maybe if printed in scientific notation would be better?

Could be, I didn't write that code and it feels well outside my area of expertise: https://github.com/simd-everywhere/simde/blob/8639feff3843dc02c8f5e2e34a2ba865d3acb546/test/test.h#L729-L748 https://github.com/simd-everywhere/simde/blob/8639feff3843dc02c8f5e2e34a2ba865d3acb546/test/test.h#L866-L867

I will try __asm pause in a couple of days

Thanks!

Epixu commented 5 months ago

The single precision comparison function is not ULPs based, it uses a precision range, which if zero - does a perfect compare. The failing functions however seem to involve double precision, so some points of failure come to mind: is the proper double-precision compare used, and is precision specified properly to not be zero, because internal CPU float representation is always assumed to be a black box (while residing in the registers, they're usually with additional bits of precision, which get truncated later) - can be anything on different CPU architectures.

Also, it would probably be better to use (%e ~= %e) to see if something changes at the back of the float

mr-c commented 5 months ago

Thanks for the explanation @Epixu , here are the same test failures using hexadecimal exponent notation. All affected tests do the float comparison using a precision of 1 == a slop of 10^-1 (which is rather generous, if you ask me).

While all of these affected tests use simde_test_x86_assert_equal_f32x{2,4} or simde_assert_equal_vf32 there are other tests that use those comparators which don't fail.

../test/x86/sse.c:2094: assertion failed: r[0] ~= test_vec[i].r[0] (-0x1.a200000000000p+9 ~= -0x1.a200000000000p+9)
../test/x86/sse.c:2167: assertion failed: r[0] ~= test_vec[i].r[0] (-0x1.5800000000000p+7 ~= -0x1.5800000000000p+7)
../test/x86/sse.c:2246: assertion failed: r[0] ~= test_vec[i].r[0] (0x1.87cc000000000p+14 ~= 0x1.87cc000000000p+14)
../test/x86/sse.c:2287: assertion failed: r[0] ~= test_vec[i].r[0] (-0x1.9000000000000p+7 ~= -0x1.9000000000000p+7)
../test/x86/sse.c:2333: assertion failed: r[0] ~= e[0] (0x1.6a60000000000p+15 ~= 0x1.6a60000000000p+15)
../test/x86/sse.c:2375: assertion failed: r[0] ~= test_vec[i].r[0] (0x1.c400000000000p+6 ~= 0x1.c400000000000p+6)
../test/x86/sse.c:2522: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (0x1.892c000000000p+15 ~= 0x1.892c000000000p+15)
../test/x86/sse.c:2555: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (0x1.9400000000000p+6 ~= 0x1.9400000000000p+6)
../test/x86/sse.c:2899: assertion failed: r[0] ~= test_vec[i].r[0] (-0x1.cdc0020000000p+1 ~= -0x1.ce147a0000000p+1)
../test/x86/sse.c:4429: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (-0x1.308f5c0000000p+9 ~= -0x1.308f5c0000000p+9)
../test/x86/sse.c:4982: assertion failed: r[0] ~= test_vec[i].r[0] (0x1.b80f5c0000000p+7 ~= 0x1.b80f5c0000000p+7)
../test/x86/sse.c:5015: assertion failed: r[0] ~= test_vec[i].r[0] (0x1.8c00000000000p+9 ~= 0x1.8c00000000000p+9)
../test/x86/sse.c:5048: assertion failed: r[0] ~= test_vec[i].r[0] (0x1.62fc280000000p+9 ~= 0x1.62fc280000000p+9)

https://ci.appveyor.com/project/nemequ/simde/build/job/uhkyqxm6fogyagv1?fullLog=true

Epixu commented 5 months ago

This is super weird. Is it possible that the error somehow happens at the compare, but is fine by the time data is printed?

Also, just to note, this wouldn't be the first time I've encountered nasty generated assembly bugs from MSVC. To give an example, I've even documented one involving AVX here: https://github.com/Epixu/msvc_but_repro

I guess our last beacon of hope is to compare some healthy ASM with what the compiler generates...

mr-c commented 5 months ago

I guess our last beacon of hope is to compare some healthy ASM with what the compiler generates...

Agreed, but that it outside of my abilities and time availability :-(

Epixu commented 5 months ago

I will investigate more when I too get the time, Still, thanks a lot for narrowing it down

mr-c commented 5 months ago

This is super weird. Is it possible that the error somehow happens at the compare, but is fine by the time data is printed?

In https://github.com/simd-everywhere/simde/commit/fc52a85544d005b4c2973fdeba1e23b932aea343 I tried crafting the test data for a single test differently, but I got the same error: ./test/x86/sse.c:2095: assertion failed: r[0] ~= simde_mm_loadu_ps(test_vec[i].r)[0] (-0x1.a200000000000p+9 ~= -0x1.a200000000000p+9)

Interestingly, the test SSE failures only are with the "native" code paths, but not the emulated one. The SSE2 failing tests are for both (this is for a build without /arch:SSE or similar)

https://ci.appveyor.com/project/nemequ/simde/build/job/u3cn62vsdvvua23o?fullLog=true

I will investigate more when I too get the time

Thanks!

Still, thanks a lot for narrowing it down

You are welcome! I think there are several fixes from this branch worth merging already, so I'm glad for that.

mr-c commented 5 months ago

One more question for you: are these issues a regression from a previous SIMDe release / commit? Or have they always been present from what you have seen?

Epixu commented 5 months ago

Honestly, I have no idea. It seems that the issues emerged in later version after introduction of more features, and the only reason the older version still works with me, is that the features are simply not available. So probably a regression. I can't really be sure, though.

I still stick to one particular commit (5e7c4d4ec10855ebe8f19af8a80fdfa26caad5e9), because it passes all my tests and lets me continue work on more important stuff

Epixu commented 1 month ago

A bit of an update:

  1. I dropped SSE1 support in my project, so half of this issue is irrelevant now (for me at least)
  2. I investigated the pow precision problem. It seems that the MSVC /fp option has huge effect in exactly those places. /fp:precise gives the best results (still wrong, but best in terms of least errors), while any other option screws completely with the tests in those exact places. Considering, that pow functions always involve loops and accumulation of errors, I think we can narrow it down to something involving the /fp option. MSVC has float_control pragmas, that could be used to experiment with on MSVC compilers to work around the issue, probably.
mr-c commented 1 month ago

Thank you for the update @Epixu !

Since the old logs are missing, can you remind me which pow related test are failing on 32-bit MSVC in your setup?

Just tests that use simde_math_pow{,f} or something more complicated?

I still stick to one particular commit (5e7c4d4), because it passes all my tests and lets me continue work on more important stuff

One thing that jumps out at me from looking at the changes since this commit: The use of pow() instead of simde_math_pow() in simde/arm/neon/cvt_n.h: https://github.com/simd-everywhere/simde/commit/2eedece7a#diff-53bfc402d4ade1eaf3147f23394e47554f52396a953fc086ffc731fa3b2d481aR91

mr-c commented 1 month ago

@Epixu Maybe https://github.com/simd-everywhere/simde/pull/1192 will help ?

Epixu commented 1 month ago

Here are the failing tests, built with the current SIMDe master, minus the SSE1 builds: https://github.com/Langulus/SIMD/actions/runs/9920771659

The failing builds are:

  1. windows-latest-v143-Release-x86-NoSIMD (uses just std::pow)
  2. windows-latest-v143-Release-x86-AVX
  3. windows-latest-v143-Release-x86-AVX2

The AVX2 build ends up executing svml_dpo4_avx2.asm, which looks like this:

--- D:\a\_work\1\s\src\vctools\crt\helperlibs\vmath\src\i386\svml_dpow4_avx2.asm 
00F8C1E0  push        ebp  
00F8C1E1  mov         ebp,esp  
00F8C1E3  and         esp,0FFFFFFC0h  
00F8C1E6  sub         esp,100h  
00F8C1EC  vmovdqa     ymm3,ymm0  
00F8C1F0  vandpd      ymm4,ymm3,ymmword ptr ds:[12F1A80h]  
00F8C1F8  vorpd       ymm2,ymm4,ymmword ptr ds:[12F1AC0h]  
00F8C200  vcvtpd2ps   xmm5,ymm2  
00F8C204  vmovupd     ymm7,ymmword ptr ds:[12F1B00h]  
00F8C20C  vmovupd     ymmword ptr [esp+20h],ymm1  
00F8C212  vmovupd     ymmword ptr [esp],ymm3  
00F8C217  vrcpps      xmm6,xmm5  
00F8C21B  vpsrlq      ymm1,ymm3,34h  
00F8C220  vcvtps2pd   ymm0,xmm6  
00F8C224  vpaddq      ymm0,ymm0,ymmword ptr ds:[12F1BC0h]  
00F8C22C  vorpd       ymm4,ymm1,ymm7  
00F8C230  vsubpd      ymm5,ymm4,ymm7  
00F8C234  vcmplt_oqpd ymm4,ymm3,ymmword ptr ds:[12F1B40h]  
00F8C23D  vpand       ymm7,ymm0,ymmword ptr ds:[12F1C00h]  
00F8C245  vmovupd     ymmword ptr [esp+40h],ymm5  
00F8C24B  vcmpnle_uqpd ymm5,ymm3,ymmword ptr ds:[12F1B80h]  
00F8C254  vpsrlq      ymm3,ymm7,27h  
00F8C259  vfmsub213pd ymm2,ymm7,ymmword ptr ds:[12F1AC0h]  
00F8C262  vmovupd     ymm7,ymmword ptr ds:[12F1C40h]  
00F8C26A  vorpd       ymm4,ymm4,ymm5  
00F8C26E  vmovupd     ymm5,ymmword ptr ds:[12F1CC0h]  
00F8C276  vfmadd213pd ymm7,ymm2,ymmword ptr ds:[12F1C80h]  
00F8C27F  vmovupd     ymmword ptr [esp+60h],ymm4  
00F8C285  vfmadd213pd ymm5,ymm2,ymmword ptr ds:[12F1D00h]  
00F8C28E  vextracti128 xmm6,ymm3,1  
00F8C294  vmovd       eax,xmm3  
00F8C298  vmovd       edx,xmm6  
00F8C29C  vmovupd     xmm1,xmmword ptr [eax+0AF3240h]  
00F8C2A4  vpextrd     ecx,xmm3,2  
00F8C2AA  vpextrd     eax,xmm6,2  
00F8C2B0  vmovupd     xmm0,xmmword ptr [edx+0AF3240h]  
00F8C2B8  vmovhpd     xmm3,xmm1,qword ptr [ecx+0AF3240h]  
00F8C2C0  vunpckhpd   xmm1,xmm1,xmmword ptr [ecx+0AF3240h]  
00F8C2C8  vmovhpd     xmm6,xmm0,qword ptr [eax+0AF3240h]  
00F8C2D0  vunpckhpd   xmm0,xmm0,xmmword ptr [eax+0AF3240h]  
00F8C2D8  vinsertf128 ymm6,ymm3,xmm6,1  
00F8C2DE  vinsertf128 ymm3,ymm1,xmm0,1  
00F8C2E4  vmulpd      ymm1,ymm2,ymm2  
00F8C2E8  vfmadd231pd ymm3,ymm2,ymmword ptr ds:[12F1D40h]  
00F8C2F1  vaddpd      ymm4,ymm6,ymmword ptr [esp+40h]  
00F8C2F7  vfmadd213pd ymm7,ymm1,ymm5  
00F8C2FC  vmovupd     ymm5,ymmword ptr ds:[12F1D80h]  
00F8C304  vmovupd     ymm6,ymmword ptr [esp+20h]  
00F8C30A  vfmadd213pd ymm7,ymm2,ymmword ptr ds:[12F1DC0h]  
00F8C313  vmovdqa     ymm0,ymm5  
00F8C317  vfmadd213pd ymm0,ymm2,ymm4  
00F8C31C  vmovupd     ymmword ptr [esp+80h],ymm7  
00F8C325  vsubpd      ymm4,ymm0,ymm4  
00F8C329  vmulpd      ymm7,ymm6,ymm0  
00F8C32D  vfmsub231pd ymm4,ymm2,ymm5  
00F8C332  vmovupd     ymm5,ymmword ptr ds:[12F1E80h]  
00F8C33A  vfmsub213pd ymm0,ymm6,ymm7  
00F8C33F  vaddpd      ymm4,ymm3,ymm4  
00F8C343  vandpd      ymm2,ymm7,ymmword ptr ds:[12F1E00h]  
00F8C34B  vcmpnle_uqpd ymm2,ymm2,ymmword ptr ds:[12F1E40h]  
00F8C354  vfmadd132pd ymm1,ymm4,ymmword ptr [esp+80h]  
00F8C35E  vorpd       ymm3,ymm2,ymmword ptr [esp+60h]  
00F8C364  vmovmskpd   eax,ymm3  
00F8C368  vmovapd     ymm3,ymm6  
00F8C36C  test        eax,eax  
00F8C36E  vfmadd213pd ymm3,ymm1,ymm7  
00F8C373  vsubpd      ymm4,ymm3,ymm7  
00F8C377  vaddpd      ymm2,ymm5,ymm3  
00F8C37B  vfmsub213pd ymm1,ymm6,ymm4  
00F8C380  vsubpd      ymm7,ymm2,ymm5  
00F8C384  vpsllq      ymm2,ymm2,3  
00F8C389  vaddpd      ymm0,ymm0,ymm1  
00F8C38D  vsubpd      ymm7,ymm3,ymm7  
00F8C391  vmovupd     ymm1,ymmword ptr ds:[12F1F00h]  
00F8C399  vaddpd      ymm3,ymm7,ymm0  
00F8C39D  vfmadd213pd ymm1,ymm7,ymmword ptr ds:[12F1F40h]  
00F8C3A6  vandpd      ymm4,ymm2,ymmword ptr ds:[12F1EC0h]  
00F8C3AE  vmovupd     ymmword ptr [esp+0A0h],ymm4  
00F8C3B7  vfmadd213pd ymm1,ymm3,ymmword ptr ds:[12F1F80h]  
00F8C3C0  vfmadd213pd ymm1,ymm3,ymmword ptr ds:[12F1FC0h]  
00F8C3C9  vxorpd      ymm2,ymm2,ymmword ptr [esp+0A0h]  
00F8C3D2  vmovd       edx,xmm4  
00F8C3D6  vpextrd     ecx,xmm4,2  
00F8C3DC  vextracti128 xmm4,ymm4,1  
00F8C3E2  vmovsd      xmm6,qword ptr [edx+12F1280h]  
00F8C3EA  vmovd       edx,xmm4  
00F8C3EE  vmovhpd     xmm5,xmm6,qword ptr [ecx+12F1280h]  
00F8C3F6  vpextrd     ecx,xmm4,2  
00F8C3FC  vmovsd      xmm6,qword ptr [edx+12F1280h]  
00F8C404  vmovhpd     xmm4,xmm6,qword ptr [ecx+12F1280h]  
00F8C40C  vinsertf128 ymm4,ymm5,xmm4,1  
00F8C412  vmulpd      ymm0,ymm4,ymm3  
00F8C416  vpsllq      ymm5,ymm2,29h  
00F8C41B  vfmadd213pd ymm1,ymm0,ymm4  
00F8C420  vmulpd      ymm0,ymm5,ymm1  
00F8C424  jne         ___avx2_pow4@@64+24Eh (0F8C42Eh)  
00F8C42A  mov         esp,ebp  
00F8C42C  pop         ebp  
00F8C42D  ret  
mr-c commented 1 month ago

Okay, so the function that is failing on 32-bit MSVC non-sse2 is https://github.com/Langulus/SIMD/blob/a7cf2a9799dbc55b5aec6d9b9f20b1bfbaa187e3/source/Pow.hpp#L33 , yes?

Epixu commented 1 month ago

The wip branch file is more up-to-date - I forgot to mention, that the CI ran on that branch: https://github.com/Langulus/SIMD/blob/wip/source/binary/Pow.hpp

The non-simd version just falls back to std::pow for real numbers.

The failing AVX2 case should be that one.

mr-c commented 1 month ago

The failing AVX2 case should be that one.

Ah, a SVML function. Okay, then this is probably the source of your problem:

https://github.com/simd-everywhere/simde/commit/593af95cd9f43a96f66a582a9ae32dfe5dfd8eaa#diff-b2776e35f1b43a42e8f31d4d30815dd8d2fd869b0c6dc2ef85f1de31ab927e4dR296-R297

I bet disabling SVML on 32-bit MSVC x86 fixes things for you. Try adding -DSIMDE_X86_SVML_NO_NATIVE on your builds to confirm?

Epixu commented 1 month ago

I bet disabling SVML on 32-bit MSVC x86 fixes things for you. Try adding -DSIMDE_X86_SVML_NO_NATIVE on your builds to confirm?

Will try it now

mr-c commented 1 month ago

@Epixu If that works, we will need to decide between not automatically enabling native SVML with 32-bit MSVC, or only selectively skipping it for certain functions. It would be good if you reported this to MS and we got a bug number so that we can disable the workaround for unaffected (future) versions of their compiler

Epixu commented 1 month ago

All tests passed 🎉

@Epixu If that works, we will need to decide between not automatically enabling native SVML with 32-bit MSVC, or only selectively skipping it for certain functions. It would be good if you reported this to MS and we got a bug number so that we can disable the workaround for unaffected (future) versions of their compiler

What should I write in the bug? That their SVML implementation is giving wrong results, probably a rounding error somewhere?

Also, why is my NoSIMD build affected and test passing, too? That one seems to be a problem on my side. I've probably not disabled SIMD entirely on that build, so it gets compiled down to svml_dpo4_avx2.asm again?

mr-c commented 1 month ago

What should I write in the bug? That their SVML implementation is giving wrong results, probably a rounding error somewhere?

Yes. Needs fixing or they can clarify in their documentation the expected errors/variation based upon architecture / CPU feature

Also, why is my NoSIMD build affected and test passing, too? That one seems to be a problem on my side. I've probably not disabled SIMD entirely on that build, so it gets compiled down to svml_dpo4_avx2.asm again?

Looking at https://github.com/Langulus/SIMD/blob/a7cf2a9799dbc55b5aec6d9b9f20b1bfbaa187e3/.github/workflows/ci.yml#L17 and https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?view=msvc-170

Then we see that no /arch means /arch:SSE2 on x86 systems, which is confusing as your explicit /arch:SSE2 build passed! Maybe try /arch:IA32 for your NoSIMD build?

mr-c commented 1 month ago

All tests passed 🎉

I missed that on my first read-through. Huzzah!

So, do you think we should:

A. disable SVML auto-on for MSVC 32-bit entirely (but still can be forced on with -DSIMDE_X86_SVML_NATIVE)

or

B. disable SVML native + MSVC 32-bit only for certain functions (with no way to force it on, unless we invent some new convention)?

Epixu commented 1 month ago

A. disable SVML auto-on for MSVC 32-bit entirely (but still can be forced on with -DSIMDE_X86_SVML_NATIVE)

Can't really say for sure - my project's scope is considerably smaller than yours. But considering that probably a niche selection of people actually attempt these builds these days, I suppose disabling SVML native for x86 MSVC builds wouldn't be that bad in the grand scheme of things. It is more likely that it will save you a couple of issue reports in the long run, and unlikely to cause any real harm, but that's just my subjective opinion.

B. disable SVML native + MSVC 32-bit only for certain functions (with no way to force it on, unless we invent some new convention)?

That seems to be an overkill and would probably involve a lot of investigation of which functions are actually botched. My tests aren't exhaustive. Don't know about yours. The first option seems to save you the trouble outright. At least for now...

Epixu commented 1 month ago

Then we see that no /arch means /arch:SSE2 on x86 systems, which is confusing as your explicit /arch:SSE2 build passed! Maybe try /arch:IA32 for your NoSIMD build?

I followed your advice and something weird happens. minwindef.h gets included for some reason, and it is full of crappy microsoft-style macro definitions that entirely break my code. Is this file included by SIMDe, or is it a MSVC compiler thing?

This happens on MSVC and clang-cl builds. Nothing changed, except /arch:IA32

Edit: It seems to be included here: https://github.com/simd-everywhere/simde/blob/32c959c1039a15cec465bfa1103b4bcf34b3ac3a/simde/x86/sse.h#L36-L39

Probably required only for MemoryBarrier functionality? Is there any way this to be moved to a cpp file instead (or left to the user to take care of), so that the namespace never gets polluted with Windows macros?

Epixu commented 1 month ago

When I remove the windows include and drop MemoryBarrier() from simde_mm_sfence, it seems to work just fine? Is this include really necessary - maybe for pre c++11 compilers? If it can be avoided, I highly recommend never do any windows includes, ever! Your users will thank you forever. :) It would be best to simply extract the required functionality from the windows libraries, instead of adding the entire emballage with it.

But I have the nagging feeling we're getting sidetracked with that thing, should I open a new issue and clean this one up?

mr-c commented 1 month ago

If it can be avoided, I highly recommend never do any windows includes, ever! Your users will thank you forever. :)

Interesting!

Here's the commit that originally added that include: https://github.com/simd-everywhere/simde/commit/398dc699c5888e925a85ccef5735dc0cf4e0fb81

mr-c commented 1 month ago

I made an issue for that, yes: https://github.com/simd-everywhere/simde/issues/1193

Epixu commented 1 month ago

I have isolated the error here with the infrastructure for experimenting with it. Turns out it happens only with large numbers - my original tests used randomized numbers, so I cherry-picked one value that reproduces the error. Here's the test code, no longer using SIMDe layer (it is used only for detection of available native intrinsics).

I'll soon write the issue on devcom

Epixu commented 3 weeks ago

All the mentioned issues have been fixed on my side, and I'm happy to announce that I'm currently using SIMDe's latest master. I'll leave you to decide whether to close this issue or not, since I still have the SVML bug to report (which will happen soon).

Thanks for the support!