Closed DmitryYudin closed 4 years ago
Hi, @DmitryYudin thanks for your input, but unfortunately it doesn't match the tests results I have for vraddhn and gives the same results my version gives for other functions. While in theory I could look up the particular input data that cause failure for your version considering my version works fine it would be better to get the reasons behind your changes first - say ARM docs quotation etc.
... get the reasons behind your changes first
The rounding definition is the only reason for this commit. This routine is a special case of such rounding with a shift value equal to data_size/2. Please, see example in my first comment. I'll try to simplify here:
s16 -> s8 <=> (s16 + 0x80) >> 8
so
vraddhn_s16(0x80) = (0x80+0x80)>>8 = 0x100>>8 = 1
You can check that without this fix the result is wrong:
vraddhn_s16(0x80) = 0
Hi, @DmitryYudin . Thanks again for attracting my attention to vraddhn problem and for your PR. Also sorry for the looooong time it took me to get into that. So, right you are, it is my misprint/bug in this family of functions and your patch is totally right. However there is one more bug there - the one being hidden by the initial one and revealed by your PR (that's why vraddhn tests have failed with it). It is in vraddhn_s64 function. Could you please do me a favor and fix it as well not to create the separate commit for it? Namely the thing needed is to replace the line 3478 as below -sum = _mm_add_epi64 (sum, mask1); //actual high half rounding +sum = _mm_add_epi32 (sum, mask1); //actual high half rounding
Then I will merge your updated PR with pleasure.
amended your PR as described above and merged it to save your time. Thanks again!
Just was waiting for the weekend to satisfy your request .
Thanks for merging.
ut: int64_t r[12]; int8x8_t a0 = vraddhn_s16(vdupq_n_s16(1UL<< 7), vdupq_n_s16(0)); r[ 0] = vget_lane_s8 (a0, 0); int16x4_t a1 = vraddhn_s32(vdupq_n_s32(1UL<<15), vdupq_n_s32(0)); r[ 1] = vget_lane_s16(a1, 0); int32x2_t a2 = vraddhn_s64(vdupq_n_s64(1UL<<31), vdupq_n_s64(0)); r[ 2] = vget_lane_s32(a2, 0); uint8x8_t a3 = vraddhn_u16(vdupq_n_u16(1UL<< 7), vdupq_n_u16(0)); r[ 3] = vget_lane_u8 (a3, 0); uint16x4_t a4 = vraddhn_u32(vdupq_n_u32(1UL<<15), vdupq_n_u32(0)); r[ 4] = vget_lane_u16(a4, 0); uint32x2_t a5 = vraddhn_u64(vdupq_n_u64(1UL<<31), vdupq_n_u64(0)); r[ 5] = vget_lane_u32(a5, 0); int8x8_t s0 = vrsubhn_s16(vdupq_n_s16(1UL<< 7), vdupq_n_s16(0)); r[ 6] = vget_lane_s8 (s0, 0); int16x4_t s1 = vrsubhn_s32(vdupq_n_s32(1UL<<15), vdupq_n_s32(0)); r[ 7] = vget_lane_s16(s1, 0); int32x2_t s2 = vrsubhn_s64(vdupq_n_s64(1UL<<31), vdupq_n_s64(0)); r[ 8] = vget_lane_s32(s2, 0); uint8x8_t s3 = vrsubhn_u16(vdupq_n_u16(1UL<< 7), vdupq_n_u16(0)); r[ 9] = vget_lane_u8 (s3, 0); uint16x4_t s4 = vrsubhn_u32(vdupq_n_u32(1UL<<15), vdupq_n_u32(0)); r[10] = vget_lane_u16(s4, 0); uint32x2_t s5 = vrsubhn_u64(vdupq_n_u64(1UL<<31), vdupq_n_u64(0)); r[11] = vget_lane_u32(s5, 0); for(unsigned i = 0; i < 12; i++) { assert(r[i] == 1); }