Closed Const-me closed 4 years ago
Thanks for the detailed information. As with many Intel instruction sets, they are often implemented with the assumption that future hardware will make them more optimal, but it seems that never materialized for this particular usage.
Ideally I'd be able to tell at compile-time if /favor:INTEL
or /favor:AMD64
was being used, but I don't see a value for it (only for /favor:ATOM
).
Resolved in this commit
@Const-me : Which do you think is faster on AMD CPUs:
__m128 vTemp = _mm_broadcastss_ps(V);
or
__m128 vTemp = _mm_shuffle_ps(V, V, _MM_SHUFFLE(0, 0, 0, 0));
Hi @walbourn.
On all recent Intel and AMD CPUs, _mm_broadcastss_ps
has latency 3 cycles, throughput 1 cycle.
On all recent Intel CPUs, _mm_shuffle_ps
has latency 1 cycle, throughout 1 cycle. On new AMD CPUs it’s even better, latency 1 cycle throughout 0.5 cycles i.e. it can run 2 of them per cycle. _mm_shuffle_ps
is strictly faster.
You can find that data in the CHM I’ve generated, there: https://github.com/Const-me/IntelIntrinsics/ The lines with blue text in the performance table come from intel.com. The lines with black text are from agner.org (an independent academic researcher who benchmarked many CPUs including AMD ones).
BTW, there’re good use cases for that AVX2 instruction, vbroadcastss
. One is the 256-bit version, _mm256_broadcastss_ps
in C++. Another one is loading from memory as opposed to broadcasting from a register, _mm_broadcast_ss
in C++, shufps
can’t do that.
Thanks, @Const-me, and that jives with my understanding. _mm_broadcastss_ps looks more like a 'might as well provide a uniform instruction set' option rather than super-useful, so that helps me determine if AVX2-specific paths for most cases are useful or not.
Minor update in this commit
TLDR: AVX
vpermilps
instruction is strictly worse than SSEshufps
instruction.This line https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMath.h#L170 slows down
XM_PERMUTE_PS
by a factor of 3-4 on AMD CPUs when AVX is enabled, with no benefits on Intel.On recent Intel chips, both
shufps
andvpermilps
have 1 cycle latency, 1 cycle throughput.On AMD Ryzen,
shufps
has 1 cycle latency 0.5 cycles throughput, while vpermilps is 3-4 times slower, with latency 3 and throughput 2.In addition, encoded shufps instruction is 1 byte shorter, 5 versus 6 bytes, i.e. there’s slight benefit even on Intel.
The only reason for
_mm_permute_ps
appears to be Intel’s Xeon Phi wherevpermilps
can be faster with twice the throughput, but AFAIK DirectXMath doesn’t support that platform?