[WIP] improve precision of vrsqrte_u32 and vrsqrteq_u32

intel / ARM_NEON_2_x86_SSE

The platform independent header allowing to compile any C/C++ code containing ARM NEON intrinsic functions for x86 target systems using SIMD up to AVX2 intrinsic functions

Other

431 stars 150 forks source link

[WIP] improve precision of vrsqrte_u32 and vrsqrteq_u32 #31

Closed beru closed 4 years ago

beru commented 5 years ago

Hi, this PR improves accuracy of vrsqrte_u32 and vrsqrteq_u32 intrinsic functions by using double-precision floating-point SSE2 instructions.

Since there isn't _mm_rsqrt_pd intrinsic function nor RSQRTPD instruction (and legacy processors do not support VRSQRT28PD instruction). I used combination of _mm_sqrt_pd and _mm_div_pd intrinsic functions that might slow down a little bit. But something tells me through my tinfoil hat that it's negligible and can be compromised.

Zvictoria commented 5 years ago

Beru, thanks again. But.... I couldn't accept it, sorry. sqrt and division (and even float point rounding in some extent) are very expensive (slow) CPU operations. I've checked your patch performance in the simple timing test and found out even for 128 bit it is slower than the serial version from ARM manual you've posted there initially. And for 64 bits it is significantly slower. Therefore you have the following option there = to explain /prove to me why/that you need the exact ARM precision. What kind of algorithm/technology is it used in? If you succeed I will replace my current implementation with your serial code based one in the master. Or you could live with an existing implementation that honestly warns about imprecision. Your move? :)