float simple_dot_product(f32x4 a, f32x4 b) {
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
f32x4 dot_product_broadcast(f32x4 a, f32x4 b) {
float d = a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
f32x4 r = {d,d,d,d};
return r;
}
float selective_dot_product(f32x4 a, f32x4 b) {
return a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
}
f32x4 selective_dot_product_selective_broadcast(f32x4 a, f32x4 b) {
float d = a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
f32x4 r = {d,d,0,d};
return r;
}
clang/llvm fails to reduce these down to simple dpps (DotProductPackedSingles) instructions when SSE4.2 is enabled, similar might be true for the double case
Note that this might be affected by fp-accuracy affecting flags, such as -fassociative-math or -ffp-contract=*, as using the dot product instruction might yield higher accuracy (taking a look at https://www.felixcloutier.com/x86/dpps its a bit unclear if intermittent rounding is performed or if this acts as a sort of multiply-add type thing)
Also note that pre-multiplying a and b yields better codegen without -ffast-math or the like, as seen in the linked collection
Given the following cpp code snippets:
```c++
float simple_dot_product(f32x4 a, f32x4 b) {
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
f32x4 dot_product_broadcast(f32x4 a, f32x4 b) {
float d = a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
f32x4 r = {d,d,d,d};
return r;
}
float selective_dot_product(f32x4 a, f32x4 b) {
return a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
}
f32x4 selective_dot_product_selective_broadcast(f32x4 a, f32x4 b) {
float d = a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
f32x4 r = {d,d,0,d};
return r;
}
```
clang/llvm fails to reduce these down to simple `dpps` (`DotProductPackedSingles`) instructions when SSE4.2 is enabled, similar might be true for the `double` case
Godbolt link with hopefully correct targets:
https://godbolt.org/z/od5ezWM19
Note that this might be affected by fp-accuracy affecting flags, such as `-fassociative-math` or `-ffp-contract=*`, as using the dot product instruction might yield higher accuracy (taking a look at https://www.felixcloutier.com/x86/dpps its a bit unclear if intermittent rounding is performed or if this acts as a sort of multiply-add type thing)
Also note that pre-multiplying `a` and `b` yields better codegen without `-ffast-math` or the like, as seen in the linked collection
Given the following cpp code snippets:
clang/llvm fails to reduce these down to simple
dpps
(DotProductPackedSingles
) instructions when SSE4.2 is enabled, similar might be true for thedouble
caseGodbolt link with hopefully correct targets: https://godbolt.org/z/od5ezWM19
Note that this might be affected by fp-accuracy affecting flags, such as
-fassociative-math
or-ffp-contract=*
, as using the dot product instruction might yield higher accuracy (taking a look at https://www.felixcloutier.com/x86/dpps its a bit unclear if intermittent rounding is performed or if this acts as a sort of multiply-add type thing) Also note that pre-multiplyinga
andb
yields better codegen without-ffast-math
or the like, as seen in the linked collection