fix: round position for vr{add,sub}hn_{s,u}{16,32,64}

intel / ARM_NEON_2_x86_SSE

The platform independent header allowing to compile any C/C++ code containing ARM NEON intrinsic functions for x86 target systems using SIMD up to AVX2 intrinsic functions

Other

430 stars 149 forks source link

fix: round position for vr{add,sub}hn_{s,u}{16,32,64} #46

Closed DmitryYudin closed 4 years ago

DmitryYudin commented 4 years ago

ut: int64_t r[12]; int8x8_t a0 = vraddhn_s16(vdupq_n_s16(1UL<< 7), vdupq_n_s16(0)); r[ 0] = vget_lane_s8 (a0, 0); int16x4_t a1 = vraddhn_s32(vdupq_n_s32(1UL<<15), vdupq_n_s32(0)); r[ 1] = vget_lane_s16(a1, 0); int32x2_t a2 = vraddhn_s64(vdupq_n_s64(1UL<<31), vdupq_n_s64(0)); r[ 2] = vget_lane_s32(a2, 0); uint8x8_t a3 = vraddhn_u16(vdupq_n_u16(1UL<< 7), vdupq_n_u16(0)); r[ 3] = vget_lane_u8 (a3, 0); uint16x4_t a4 = vraddhn_u32(vdupq_n_u32(1UL<<15), vdupq_n_u32(0)); r[ 4] = vget_lane_u16(a4, 0); uint32x2_t a5 = vraddhn_u64(vdupq_n_u64(1UL<<31), vdupq_n_u64(0)); r[ 5] = vget_lane_u32(a5, 0); int8x8_t s0 = vrsubhn_s16(vdupq_n_s16(1UL<< 7), vdupq_n_s16(0)); r[ 6] = vget_lane_s8 (s0, 0); int16x4_t s1 = vrsubhn_s32(vdupq_n_s32(1UL<<15), vdupq_n_s32(0)); r[ 7] = vget_lane_s16(s1, 0); int32x2_t s2 = vrsubhn_s64(vdupq_n_s64(1UL<<31), vdupq_n_s64(0)); r[ 8] = vget_lane_s32(s2, 0); uint8x8_t s3 = vrsubhn_u16(vdupq_n_u16(1UL<< 7), vdupq_n_u16(0)); r[ 9] = vget_lane_u8 (s3, 0); uint16x4_t s4 = vrsubhn_u32(vdupq_n_u32(1UL<<15), vdupq_n_u32(0)); r[10] = vget_lane_u16(s4, 0); uint32x2_t s5 = vrsubhn_u64(vdupq_n_u64(1UL<<31), vdupq_n_u64(0)); r[11] = vget_lane_u32(s5, 0); for(unsigned i = 0; i < 12; i++) { assert(r[i] == 1); }

Zvictoria commented 4 years ago

Hi, @DmitryYudin thanks for your input, but unfortunately it doesn't match the tests results I have for vraddhn and gives the same results my version gives for other functions. While in theory I could look up the particular input data that cause failure for your version considering my version works fine it would be better to get the reasons behind your changes first - say ARM docs quotation etc.

DmitryYudin commented 4 years ago

... get the reasons behind your changes first

The rounding definition is the only reason for this commit. This routine is a special case of such rounding with a shift value equal to data_size/2. Please, see example in my first comment. I'll try to simplify here:

s16 -> s8 <=> (s16 + 0x80) >> 8

vraddhn_s16(0x80) = (0x80+0x80)>>8 = 0x100>>8 = 1

You can check that without this fix the result is wrong:

vraddhn_s16(0x80) = 0

Zvictoria commented 4 years ago

Hi, @DmitryYudin . Thanks again for attracting my attention to vraddhn problem and for your PR. Also sorry for the looooong time it took me to get into that. So, right you are, it is my misprint/bug in this family of functions and your patch is totally right. However there is one more bug there - the one being hidden by the initial one and revealed by your PR (that's why vraddhn tests have failed with it). It is in vraddhn_s64 function. Could you please do me a favor and fix it as well not to create the separate commit for it? Namely the thing needed is to replace the line 3478 as below -sum = _mm_add_epi64 (sum, mask1); //actual high half rounding +sum = _mm_add_epi32 (sum, mask1); //actual high half rounding

Then I will merge your updated PR with pleasure.

Zvictoria commented 4 years ago

amended your PR as described above and merged it to save your time. Thanks again!

DmitryYudin commented 4 years ago

Just was waiting for the weekend to satisfy your request .

Thanks for merging.