rikusalminen / threedee-simd

3d math with C and SIMD intrinsics
zlib License
42 stars 4 forks source link

alternate vunit #7

Open aktau opened 10 years ago

aktau commented 10 years ago

While working a bit on my toy language and searching for SIMD tips, I encountered this article: http://webcache.googleusercontent.com/search?q=cache:cMDSJGbFY-MJ:www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/+&cd=3&hl=en&ct=clnk&gl=be

In which it is stated:

__m128 normalize(__m128 m)
{
    __m128 l = _mm_mul_ps(m, m);
    l = _mm_add_ps(l, _mm_shuffle_ps(l, l, 0x4E));
    return _mm_div_ps(m, _mm_sqrt_ps(_mm_add_ps(l,
                                       _mm_shuffle_ps(l, l, 0x11))));
}

The function is really optimized. It gives hints the compiler what should be a temporary variable and what should be reused and takes a total of 7 operations.

The results we expect are perfect projection of the SSE intrinsics to assembly using only 3 vectors (original, length and square):

It seems to be some sort of inlined variant of the SSE2 code of vdot into vunit with higher accuracy (no rsqrt). Just leaving it here for the future, to verify. It would be interesting to compare. I'm reasonably sure the divps would adversely affect the performance of the article's code, but I'd be interested to find out.

Of course, the very best perf could be obtained by doing multiple vectors at once and transforming to SoA form (either on the fly or not): https://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx

rikusalminen commented 10 years ago

Thanks for this suggestion and the links. I have not been working on this project in a long time, but I'll reconsider this when I get back to some SIMD 3d stuff.

I guess some kind of benchmarking could be useful, there are other functions in addition to normalization that have multiple implementations and I can't tell which ones are fast and which ones are slow.