AVX-SSE transistion penalties

ihhub / penguinV

Computer vision library with focus on heterogeneous systems

Other

118 stars 90 forks source link

AVX-SSE transistion penalties #331

Open 0x72D0 opened 5 years ago

0x72D0 commented 5 years ago

the transistion between AVX and SSE cause penalties. To avoid those penalties, we might want to add _mm256_zeroupper() at the end of all AVX SIMD function.

[https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties]() (3.3. Method 3: Zeroing Registers)
[https://www.agner.org/optimize/optimizing_cpp.pdf]() (p.109 - 12.1)

ihhub commented 5 years ago

We need to consider this as all our functions have AVX and SSE so ideally we shouldn't face switching but idea is valid for a discussion.

0x72D0 commented 5 years ago

it's not the case for all the function, the hough transform for example. Also, sometimes the compiler optimize with SIMD instruction. So if we run Accumulate for AVX for example and after that we run an hough transform and the compiler add SSE instruction in the hough transform, then we slow down each SSE-AVX transition by 10 cycle.

Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.

1. Introduction to AVX-SSE Transition Penalties

ihhub commented 5 years ago

No-no, I understand your concern. What I meant is that as of now we have all functions which are implemented in both AVX and SSE so if we run the code on CPU with AVX support we should run only AVX code without switching to SSE. I agree regarding SIMD optimisation. Just is it worth to do this for 10 cycles if we switch from AVX to SSE for functions where we process millions of bytes?

ihhub commented 5 years ago

@0x72D0 we could just modify such code:

#ifdef PENGUINV_AVX_SET
#define AVX_CODE( code )          \
if ( simdType == avx_function ) { \
    code;                         \
    put instruction here <---
    return;                       \
}
#else
#define AVX_CODE( code )
#endif

But what would be a penalty for multithreading case?

0x72D0 commented 5 years ago

yeah maybe with multithreading we would have a significant performance loss. Also I find this topic when searching for the VZEROUPPER latencies:

When AVX was introduced with 256-bit vector registers, we were told to use the instruction VZEROUPPER to avoid a severe penalty when switching between VEX and non-VEX code. Four generations of Intel processors had such a penalty (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell). AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch. They have no need for the VZEROUPPER.

https://www.agner.org/optimize/blog/read.php?i=789

So if skylake is not affected, the state switch penalties might just disappear with time

ihhub commented 5 years ago

I put this issue to WishList as it's not so urgent but it's good to review in future for sure.