Closed beru closed 4 years ago
Beru, thanks again. But.... I couldn't accept it, sorry. sqrt and division (and even float point rounding in some extent) are very expensive (slow) CPU operations. I've checked your patch performance in the simple timing test and found out even for 128 bit it is slower than the serial version from ARM manual you've posted there initially. And for 64 bits it is significantly slower. Therefore you have the following option there = to explain /prove to me why/that you need the exact ARM precision. What kind of algorithm/technology is it used in? If you succeed I will replace my current implementation with your serial code based one in the master. Or you could live with an existing implementation that honestly warns about imprecision. Your move? :)
Hi, this PR improves accuracy of
vrsqrte_u32
andvrsqrteq_u32
intrinsic functions by using double-precision floating-point SSE2 instructions.Since there isn't
_mm_rsqrt_pd
intrinsic function norRSQRTPD
instruction (and legacy processors do not supportVRSQRT28PD
instruction). I used combination of_mm_sqrt_pd
and_mm_div_pd
intrinsic functions that might slow down a little bit. But something tells me through my tinfoil hat that it's negligible and can be compromised.