Closed erri120 closed 2 years ago
While AVX support is fairly common these days, early on I was releasing builds and found there are a non-insignificant number of people playing SSE that didn't have AVX compatible hardware. Wanting to support players on older hardware, I added the SIMD switch to the build scripts and released the two different versions. This has caused some confusion however as some people just don't understand what SSE and AVX are (nor can I reasonably expect them to).
Moving forward then, I've been working on a port of the whole project to CommonLibSSE alongside usual new features and fixes. This newer version will just use SSE - it reduces end user confusion and is the most compatible while still being performant enough. As for use of xmmintrin.h, within my own code I only use intrinsics in a single math function using Intel's SVML. I only require load(u)_ps and __m128 for this. The real vectorized stuff is coming from the GLM dependency.
Currently development is on hold until after AE releases, at which point I'll need to reverse-engineer a lot of things over again based on the available information about Bethesda upgrading compiler versions. If they happen to enable AVX code generation then that would be a good reason to merge this at that time, though I imagine such an action by Bethesda would cause a lot of drama. Feel free to keep this open until after the anniversary edition update is released, at which point we'll know for sure what has and hasn't changed.
@mwilsnd fixed the crash while dismounting problem (#54) in the last commit. Some testers of my AVX fork reported this problem and only found out later that this was an issue on the master branch itself.
Feel free to publish a github release under your fork for those encountering this issue - as mentioned in the referenced issue I won't be able to push an official fix yet for the reasons mentioned in my prior reply.
Reviewing changes made to mmath I must mention modifications made to DecomposeToBasis are a performance regression. SIMD aligned vec3s have 4 floats of storage - using SSE register sizes the SVML sincos_ps function will run sincos on 4 packed floats, with 1 float of waste computation for vec3. Moving to AVX registers adds an extra 4 floats of waste computation and also loads 4 floats of data past the end of the given vec3. In practice this will load garbage data from the stack into the second half of the AVX register. As AVX instructions tend to have a higher latency and op-count compared to SSE instructions, there is a negative benefit from this added waste computation.
Really, this is all a micro-optimization on my part and I'm sure the compiler would've generated comparable code. But with the explicit use of intrinsics attention must be paid to cases like this.
closed with 1.6 release
This PR addresses the AVX issues:
The first change was made to the imports: https://github.com/mwilsnd/SkyrimSE-SmoothCam/blob/c6f4ffcc6e8d9258103255c6b77d32669b244c49/SmoothCam/include/pch.h#L9 There are multiple different headers available for working with SSE and AVX, a list can be found here and is discussed in this stackoverflow post:
This PR changes the
xmmintrin.h
include toimmintrin.h
and uses the AVX__m256
functions instead of the SSE ones. I also updated the build scripts to usevectorextensions "AVX"
and removed the_SIMD_MODE
option.AVX is supported by almost every CPU that has been released in the past 10 years (almost as old as Skyrim LE) so anyone playing modded Skyrim SE should have a compatible CPU which is why this PR goes fully AVX.
Here is a debug build with all the changes: SmoothCam.zip