[X86] Vector-Vector dot product not reduced to corresponding single DPPS or DPPD instruction

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

Other

26.82k stars 10.99k forks source link

Given the following cpp code snippets:

float simple_dot_product(f32x4 a, f32x4 b) {
    return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}

f32x4 dot_product_broadcast(f32x4 a, f32x4 b) {
    float d = a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
    f32x4 r = {d,d,d,d};
    return r;
}

float selective_dot_product(f32x4 a, f32x4 b) {
    return a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
}

f32x4 selective_dot_product_selective_broadcast(f32x4 a, f32x4 b) {
    float d = a[0] * b[0] + a[2] * b[2] + a[3] * b[3];
    f32x4 r = {d,d,0,d};
    return r;
}

clang/llvm fails to reduce these down to simple dpps (DotProductPackedSingles) instructions when SSE4.2 is enabled, similar might be true for the double case

Godbolt link with hopefully correct targets: https://godbolt.org/z/od5ezWM19

Note that this might be affected by fp-accuracy affecting flags, such as -fassociative-math or -ffp-contract=*, as using the dot product instruction might yield higher accuracy (taking a look at https://www.felixcloutier.com/x86/dpps its a bit unclear if intermittent rounding is performed or if this acts as a sort of multiply-add type thing) Also note that pre-multiplying a and b yields better codegen without -ffast-math or the like, as seen in the linked collection

@llvm/issue-subscribers-backend-x86

Author: Leon Albrecht (Hendiadyoin1)

Given the following cpp code snippets: ```c++ float simple_dot_product(f32x4 a, f32x4 b) { return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3]; } f32x4 dot_product_broadcast(f32x4 a, f32x4 b) { float d = a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3]; f32x4 r = {d,d,d,d}; return r; } float selective_dot_product(f32x4 a, f32x4 b) { return a[0] * b[0] + a[2] * b[2] + a[3] * b[3]; } f32x4 selective_dot_product_selective_broadcast(f32x4 a, f32x4 b) { float d = a[0] * b[0] + a[2] * b[2] + a[3] * b[3]; f32x4 r = {d,d,0,d}; return r; } ``` clang/llvm fails to reduce these down to simple `dpps` (`DotProductPackedSingles`) instructions when SSE4.2 is enabled, similar might be true for the `double` case Godbolt link with hopefully correct targets: https://godbolt.org/z/od5ezWM19 Note that this might be affected by fp-accuracy affecting flags, such as `-fassociative-math` or `-ffp-contract=*`, as using the dot product instruction might yield higher accuracy (taking a look at https://www.felixcloutier.com/x86/dpps its a bit unclear if intermittent rounding is performed or if this acts as a sort of multiply-add type thing) Also note that pre-multiplying `a` and `b` yields better codegen without `-ffast-math` or the like, as seen in the linked collection

llvm / llvm-project

[X86] Vector-Vector dot product not reduced to corresponding single DPPS or DPPD instruction #97580