mwilsnd / SkyrimSE-SmoothCam

A thirdperson camera mod for Skyrim Special Edition
78 stars 19 forks source link

AVX Fixes #59

Closed erri120 closed 2 years ago

erri120 commented 2 years ago

This PR addresses the AVX issues:

The first change was made to the imports: https://github.com/mwilsnd/SkyrimSE-SmoothCam/blob/c6f4ffcc6e8d9258103255c6b77d32669b244c49/SmoothCam/include/pch.h#L9 There are multiple different headers available for working with SSE and AVX, a list can be found here and is discussed in this stackoverflow post:

+----------------+------------------------------------------------------------------------------------------+
|     Header     |                                         Purpose                                          |
+----------------+------------------------------------------------------------------------------------------+
| x86intrin.h    | Everything, including non-vector x86 instructions like _rdtsc().                         |
| mmintrin.h     | MMX (Pentium MMX!)                                                                       |
| mm3dnow.h      | 3dnow! (K6-2) (deprecated)                                                               |
| xmmintrin.h    | SSE + MMX (Pentium 3, Athlon XP)                                                         |
| emmintrin.h    | SSE2 + SSE + MMX (Pentium 4, Athlon 64)                                                  |
| pmmintrin.h    | SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Athlon 64 San Diego)                        |
| tmmintrin.h    | SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer)                                      |
| popcntintrin.h | POPCNT (Nehalem (Core i7), Phenom)                                                       |
| ammintrin.h    | SSE4A + SSE3 + SSE2 + SSE + MMX (AMD-only, starting with Phenom)                         |
| smmintrin.h    | SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Penryn, Bulldozer)                             |
| nmmintrin.h    | SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Nehalem (aka Core i7), Bulldozer)     |
| wmmintrin.h    | AES (Core i7 Westmere, Bulldozer)                                                        |
| immintrin.h    | AVX, AVX2, AVX512, all SSE+MMX (except SSE4A and XOP), popcnt, BMI/BMI2, FMA             |
+----------------+------------------------------------------------------------------------------------------+

This PR changes the xmmintrin.h include to immintrin.h and uses the AVX __m256 functions instead of the SSE ones. I also updated the build scripts to use vectorextensions "AVX" and removed the _SIMD_MODE option.

AVX is supported by almost every CPU that has been released in the past 10 years (almost as old as Skyrim LE) so anyone playing modded Skyrim SE should have a compatible CPU which is why this PR goes fully AVX.

Here is a debug build with all the changes: SmoothCam.zip

mwilsnd commented 2 years ago

While AVX support is fairly common these days, early on I was releasing builds and found there are a non-insignificant number of people playing SSE that didn't have AVX compatible hardware. Wanting to support players on older hardware, I added the SIMD switch to the build scripts and released the two different versions. This has caused some confusion however as some people just don't understand what SSE and AVX are (nor can I reasonably expect them to).

Moving forward then, I've been working on a port of the whole project to CommonLibSSE alongside usual new features and fixes. This newer version will just use SSE - it reduces end user confusion and is the most compatible while still being performant enough. As for use of xmmintrin.h, within my own code I only use intrinsics in a single math function using Intel's SVML. I only require load(u)_ps and __m128 for this. The real vectorized stuff is coming from the GLM dependency.

Currently development is on hold until after AE releases, at which point I'll need to reverse-engineer a lot of things over again based on the available information about Bethesda upgrading compiler versions. If they happen to enable AVX code generation then that would be a good reason to merge this at that time, though I imagine such an action by Bethesda would cause a lot of drama. Feel free to keep this open until after the anniversary edition update is released, at which point we'll know for sure what has and hasn't changed.

erri120 commented 2 years ago

@mwilsnd fixed the crash while dismounting problem (#54) in the last commit. Some testers of my AVX fork reported this problem and only found out later that this was an issue on the master branch itself.

mwilsnd commented 2 years ago

Feel free to publish a github release under your fork for those encountering this issue - as mentioned in the referenced issue I won't be able to push an official fix yet for the reasons mentioned in my prior reply.

Reviewing changes made to mmath I must mention modifications made to DecomposeToBasis are a performance regression. SIMD aligned vec3s have 4 floats of storage - using SSE register sizes the SVML sincos_ps function will run sincos on 4 packed floats, with 1 float of waste computation for vec3. Moving to AVX registers adds an extra 4 floats of waste computation and also loads 4 floats of data past the end of the given vec3. In practice this will load garbage data from the stack into the second half of the AVX register. As AVX instructions tend to have a higher latency and op-count compared to SSE instructions, there is a negative benefit from this added waste computation.

Really, this is all a micro-optimization on my part and I'm sure the compiler would've generated comparable code. But with the explicit use of intrinsics attention must be paid to cases like this.

erri120 commented 2 years ago

closed with 1.6 release