intel / ARM_NEON_2_x86_SSE

The platform independent header allowing to compile any C/C++ code containing ARM NEON intrinsic functions for x86 target systems using SIMD up to AVX2 intrinsic functions
Other
430 stars 149 forks source link

neon performance number question on Android+Nexus-Player #7

Closed ggfan closed 7 years ago

ggfan commented 7 years ago

with this sample https://github.com/googlesamples/android-ndk/tree/master/hello-neon when executing on nexus Player, the performance is half of c version( actually slower); while on ARM7 it boost the performance number by 2x. Is there something obviously wrong?

caand commented 7 years ago

Hi Gerry,

That is a very bad example. It's using 64-bit registers which are not translated efficiently to SSE.

Also the loop is very simple and short – you can have limitations in actual looping and/or memory access time.

Regards,

Calin Andrian

From: Gerry [mailto:notifications@github.com] Sent: Tuesday, 15 August, 2017 19:52 To: intel/ARM_NEON_2_x86_SSE Cc: Subscribed Subject: [intel/ARM_NEON_2_x86_SSE] neon performance number question on Android+Nexus-Player (#7)

with this sample https://github.com/googlesamples/android-ndk/tree/master/hello-neon when executing on nexus Player, the performance is half of c version( actually slower); while on ARM7 it boost the performance number by 2x. Is there something obviously wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/intel/ARM_NEON_2_x86_SSE/issues/7 , or mute the thread https://github.com/notifications/unsubscribe-auth/ALa5uZultNMo7C_oZrLSj2iiR1tapA-Sks5sYdq2gaJpZM4O35bT .Image removed by sender.

Zvictoria commented 7 years ago

HI, Gerry. Thanks for asking. While the answer from Calin below is true indeed, there may be another reason. Could you please check if compiler auto vectrorization is on when compiling pure C version? If it is (and I assume it is) then you comparison is not fair because for the simple cases the compiler generated code may be faster. It is also worth looking at your compiler options for ARM_NEON_2_x86_SSE compilation. Probably they are not optimal enough. And again – Calin’s answer is more than valid there

regards

Victoria

From: caand [mailto:notifications@github.com] Sent: Wednesday, August 16, 2017 1:29 PM To: intel/ARM_NEON_2_x86_SSE ARM_NEON_2_x86_SSE@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [intel/ARM_NEON_2_x86_SSE] neon performance number question on Android+Nexus-Player (#7)

Hi Gerry,

That is a very bad example. It's using 64-bit registers which are not translated efficiently to SSE.

Also the loop is very simple and short – you can have limitations in actual looping and/or memory access time.

Regards,

Calin Andrian

From: Gerry [mailto:notifications@github.com] Sent: Tuesday, 15 August, 2017 19:52 To: intel/ARM_NEON_2_x86_SSE Cc: Subscribed Subject: [intel/ARM_NEON_2_x86_SSE] neon performance number question on Android+Nexus-Player (#7)

with this sample https://github.com/googlesamples/android-ndk/tree/master/hello-neon when executing on nexus Player, the performance is half of c version( actually slower); while on ARM7 it boost the performance number by 2x. Is there something obviously wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/intel/ARM_NEON_2_x86_SSE/issues/7 , or mute the thread https://github.com/notifications/unsubscribe-auth/ALa5uZultNMo7C_oZrLSj2iiR1tapA-Sks5sYdq2gaJpZM4O35bT .Image removed by sender.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/intel/ARM_NEON_2_x86_SSE/issues/7#issuecomment-322729948, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWIpl__9K6B2_IPjBKdmJ8sNlPrsTIR6ks5sYsRugaJpZM4O35bT.


Joint Stock Company Intel A/O Registered legal address: Krylatsky Hills Business Park, 17 Krylatskaya Str., Bldg 4, Moscow 121614, Russian Federation

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

ggfan commented 7 years ago

thank you for your quick response! what are the suggested ways ( register size for example ) for ATOM + NEON? this sample has been inside NDK for a while, developers mainly wants to know how to compile things with NDK toolchain, it fulfill its purpose. However it might make a point to show developers that neon really buys some performance ( otherwise why bother ?). Hence I think it make some sense to look at the performance numbers for both arm and x86.

I think I realized that @caad has been sending questions about the neon for NDK. if you could send me more detailed info about how to improve this sample to show x86 is matching the performance on C front(probably better, matching ARM's neon), I would like to get it improved. I also understand this would be extra burden on you guys since you do not support it anymore, but some info might help.

ggfan commented 7 years ago

Oops, accidentally closed it

Zvictoria commented 7 years ago

Hi, ggfan. Let me explain it again: "the pure C performance" you are referring to looks like SSE one (created under the hood by compiler). Is it absolute value worse than for ARM NEON one? And one more thing to consider - it is important: one needs to define USE_SSE4 prior /in the beginning of porting header usage if compiler doesn't do it automatically. (I guess it doesn't) It could speed things up significantly

While optimization of any piece of code using NEON to work better on SSE is totally outside of this tool scope, you may look at compile time warnings about serial execution slowdown . If any of such instructions are present you may change the code to avoid them. I've looked into your particular case - here if USE_SSE4 is defined and still no speedup seen you may try to replace the weird sequence of vgetq_lane_s32 by storing sum_vec to memory array directly and then adding its members.

caand commented 7 years ago

Thanks, Victoria - we think the same. This is not a magic tool, you have to look under the hood sometimes. But it's very useful - I use it extensively, with very little SSE specific code when needed. Gerry, the example is really against SSE philosophy. It reads 64-bit chunks (advised against in SSE). It uses NEON long (16-to-32) multiply-accumulate (no direct or even decent equivalent in SSE). Just look at the code to emulate the long mul, without accumulate (vmull_s16):

ifdef USE_SSE4

    __m128i a16, b16;
    a16 = _MM_CVTEPI16_EPI32 (_pM128i(a)); // SSE 4.1
    b16 = _MM_CVTEPI16_EPI32 (_pM128i(b)); // SSE 4.1
    return _MM_MULLO_EPI32 (a16, b16); // SSE 4.1

else

    __m128i low, hi, a128,b128;
    a128 = _pM128i(a);
    b128 = _pM128i(b);
    low =  _mm_mullo_epi16(a128,b128);
    hi =   _mm_mulhi_epi16(a128,b128);
    return _mm_unpacklo_epi16(low,hi);

endif

On the other hand, long multiply exists in scalar x86 and, with the probable double-issue CPU, will run very fast. As long as the inputs are shorts and the arithmetics are on ints, this is not going to work well. An example with same type for inputs and arithmetic (float?) will give good results. Pay attention also how you compile the example for arm without neon. If you use thumb you'll miss the long multiply-accumulate opcode and the toolchain default optimization is -Os instead of -O2 (quite oddly).

By the way (Victoria), my preferred way to load shorts without SSE4 is (written in neon/sse mixed slang): GCC_INLINE int32x4_t get4shorts(short in) { int16x8_t ii = _mm_loadl_epi64((void ) (in)); return _mm_srai_epi32(_mm_unpacklo_epi16(ii, ii), 16); } Best regards, Calin Andrian

On Thu, Aug 17, 2017 at 10:45 AM, Victoria notifications@github.com wrote:

Hi, ggfan. Let me explain it again: "the pure C performance" you are referring to looks like SSE one (created under the hood by compiler). Is it absolute value worse than for ARM NEON one? While optimization of any piece of code using NEON to work better on SSE is totally outside of this tool scope, you may look at compile time warnings about serial execution slowdown . If any of such instructions are present you may change the code to avoid them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/intel/ARM_NEON_2_x86_SSE/issues/7#issuecomment-323007856, or mute the thread https://github.com/notifications/unsubscribe-auth/ALa5uVc5ZlMiLmgtnql_8ukZxuuwF3PSks5sY_2YgaJpZM4O35bT .

ggfan commented 7 years ago

thank you for your help, I will hack on it later: will also take your input and update readme to reflect the fact.