microsoft / DirectXMath

DirectXMath is an all inline SIMD C++ linear algebra library for use in games and graphics apps
https://walbourn.github.io/introducing-directxmath/
MIT License
1.52k stars 234 forks source link

Permute is inefficient on AMD CPUs #95

Closed Const-me closed 4 years ago

Const-me commented 4 years ago

TLDR: AVX vpermilps instruction is strictly worse than SSE shufps instruction.

This line https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMath.h#L170 slows down XM_PERMUTE_PS by a factor of 3-4 on AMD CPUs when AVX is enabled, with no benefits on Intel.

On recent Intel chips, both shufps and vpermilps have 1 cycle latency, 1 cycle throughput.

On AMD Ryzen, shufps has 1 cycle latency 0.5 cycles throughput, while vpermilps is 3-4 times slower, with latency 3 and throughput 2.

In addition, encoded shufps instruction is 1 byte shorter, 5 versus 6 bytes, i.e. there’s slight benefit even on Intel.

The only reason for _mm_permute_ps appears to be Intel’s Xeon Phi where vpermilps can be faster with twice the throughput, but AFAIK DirectXMath doesn’t support that platform?

walbourn commented 4 years ago

Thanks for the detailed information. As with many Intel instruction sets, they are often implemented with the assumption that future hardware will make them more optimal, but it seems that never materialized for this particular usage.

Ideally I'd be able to tell at compile-time if /favor:INTEL or /favor:AMD64 was being used, but I don't see a value for it (only for /favor:ATOM).

walbourn commented 4 years ago

Resolved in this commit

walbourn commented 4 years ago

@Const-me : Which do you think is faster on AMD CPUs:

__m128  vTemp = _mm_broadcastss_ps(V);

or

__m128 vTemp = _mm_shuffle_ps(V, V, _MM_SHUFFLE(0, 0, 0, 0));
Const-me commented 4 years ago

Hi @walbourn.

On all recent Intel and AMD CPUs, _mm_broadcastss_ps has latency 3 cycles, throughput 1 cycle. On all recent Intel CPUs, _mm_shuffle_ps has latency 1 cycle, throughout 1 cycle. On new AMD CPUs it’s even better, latency 1 cycle throughout 0.5 cycles i.e. it can run 2 of them per cycle. _mm_shuffle_ps is strictly faster.

You can find that data in the CHM I’ve generated, there: https://github.com/Const-me/IntelIntrinsics/ The lines with blue text in the performance table come from intel.com. The lines with black text are from agner.org (an independent academic researcher who benchmarked many CPUs including AMD ones).

BTW, there’re good use cases for that AVX2 instruction, vbroadcastss. One is the 256-bit version, _mm256_broadcastss_ps in C++. Another one is loading from memory as opposed to broadcasting from a register, _mm_broadcast_ss in C++, shufps can’t do that.

walbourn commented 4 years ago

Thanks, @Const-me, and that jives with my understanding. _mm_broadcastss_ps looks more like a 'might as well provide a uniform instruction set' option rather than super-useful, so that helps me determine if AVX2-specific paths for most cases are useful or not.

walbourn commented 4 years ago

Minor update in this commit