alternate vunit - Githubissues

While working a bit on my toy language and searching for SIMD tips, I encountered this article: http://webcache.googleusercontent.com/search?q=cache:cMDSJGbFY-MJ:www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/+&cd=3&hl=en&ct=clnk&gl=be

In which it is stated:

__m128 normalize(__m128 m)
{
    __m128 l = _mm_mul_ps(m, m);
    l = _mm_add_ps(l, _mm_shuffle_ps(l, l, 0x4E));
    return _mm_div_ps(m, _mm_sqrt_ps(_mm_add_ps(l,
                                       _mm_shuffle_ps(l, l, 0x11))));
}

The function is really optimized. It gives hints the compiler what should be a temporary variable and what should be reused and takes a total of 7 operations.

The results we expect are perfect projection of the SSE intrinsics to assembly using only 3 vectors (original, length and square):

It seems to be some sort of inlined variant of the SSE2 code of vdot into vunit with higher accuracy (no rsqrt). Just leaving it here for the future, to verify. It would be interesting to compare. I'm reasonably sure the divps would adversely affect the performance of the article's code, but I'd be interested to find out.

Of course, the very best perf could be obtained by doing multiple vectors at once and transforming to SoA form (either on the fly or not): https://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx

rikusalminen / threedee-simd

alternate vunit #7