llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.34k stars 12.13k forks source link

Math function vectorization failure with AVX-512 #94419

Open m13253 opened 5 months ago

m13253 commented 5 months ago

I am writing a machine learning software that needs to compute “Y = exp(a⋅X)”.

Sample code:

#include <cmath>
#include <cstddef>

void func(float a[]) {
    for(std::size_t i = 0; i != 16; i++) {
        a[i] = std::exp(a[i] * 2.0f);
    }
}

Expected output:

push    rbx
mov     rbx, rdi
vmovups zmm0, ZMMWORD PTR [rdi]
vaddps  zmm0, zmm0, zmm0
call    _ZGVeN16v_expf@PLT
vmovups ZMMWORD PTR [rbx], zmm0
pop     rbx
vzeroupper
ret

Actual output: Shuffles numbers between SIMD registers and GP registers multiple times, but never calls any vectorized math functions. (See https://godbolt.org/z/975T6xbss)

Clang version: 18.1.0

Compilation flags: clang++ -Ofast -fopenmp -fveclib=libmvec -mprefer-vector-width=512 -march=skylake-avx512


Alternate 1: without * 2.0f.

void func(float a[]) {
    for(std::size_t i = 0; i != 16; i++) {
        a[i] = std::exp(a[i]);
    }
}

Output: Calls the AVX2 math function, instead of the AVX-512 one.

Alternate 2: separate * 2.0f and std::exp.

void func(float a[]) {
    for(std::size_t i = 0; i != 16; i++) {
        a[i] *= 2.0f;
    }
    for(std::size_t i = 0; i != 16; i++) {
        a[i] = std::exp(a[i]);
    }
}

Output: Fails to use any vectorized math functions.

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-x86

Author: Star Brilliant (m13253)

I am writing a machine learning software that needs to compute “Y = exp(a⋅X)”. Sample code: ```c++ #include <cmath> #include <cstddef> void func(float a[]) { for(std::size_t i = 0; i != 16; i++) { a[i] = std::exp(a[i] * 2.0f); } } ``` Expected output: ```asm push rbx mov rbx, rdi vmovups zmm0, ZMMWORD PTR [rdi] vaddps zmm0, zmm0, zmm0 call _ZGVeN16v_expf@PLT vmovups ZMMWORD PTR [rbx], zmm0 pop rbx vzeroupper ret ``` Actual output: Shuffles numbers between SIMD registers and GP registers multiple times, but never calls any vectorized math functions. (See <https://godbolt.org/z/975T6xbss>) Clang version: 18.1.0 Compilation flags: `clang++ -Ofast -fopenmp -fveclib=libmvec -mprefer-vector-width=512 -march=skylake-avx512` --- **Alternate 1:** without `* 2.0f`. ```c++ void func(float a[]) { for(std::size_t i = 0; i != 16; i++) { a[i] = std::exp(a[i]); } } ``` Output: Calls the AVX2 math function, instead of the AVX-512 one. **Alternate 2:** separate `* 2.0f` and `std::exp`. ```c++ void func(float a[]) { for(std::size_t i = 0; i != 16; i++) { a[i] *= 2.0f; } for(std::size_t i = 0; i != 16; i++) { a[i] = std::exp(a[i]); } } ``` Output: Fails to use any vectorized math functions.
omern1 commented 1 month ago

There isn't a 16 wide exp(float) in libmvec (or atleast LLVM isn't aware of it) which is why your vectorized loop gets expanded into straight-line scalar code. If you remove the "-mprefer-vector-width=512" you'll get two calls to 8 wide exp(float).

m13253 commented 1 month ago

There isn't a 16 wide exp(float) in libmvec (or atleast LLVM isn't aware of it) which is why your vectorized loop gets expanded into straight-line scalar code. If you remove the "-mprefer-vector-width=512" you'll get two calls to 8 wide exp(float).

In my glibc 2.40+r16+gaa533d58ff-2, there is a 16 wide exp(float).

$ objdump -T /usr/lib/libmvec.so.1 | grep '_ZGV.*expf\?$'
0000000000008560 g    DF .text  000000000000003d  GLIBC_2.22  _ZGVcN8v_expf
00000000000062a0 g   iD  .text  0000000000000025  GLIBC_2.22  _ZGVbN2v_exp
0000000000007a70 g   iD  .text  0000000000000025  GLIBC_2.22  _ZGVbN4v_expf
0000000000006d90 g    DF .text  000000000000003d  GLIBC_2.22  _ZGVcN4v_exp
0000000000007f80 g   iD  .text  000000000000002e  GLIBC_2.22  _ZGVdN8v_expf
0000000000008d00 g   iD  .text  0000000000000049  GLIBC_2.22  _ZGVeN16v_expf (← This one)
00000000000074c0 g   iD  .text  0000000000000049  GLIBC_2.22  _ZGVeN8v_exp
00000000000067b0 g   iD  .text  000000000000002e  GLIBC_2.22  _ZGVdN4v_exp

Alternatively I tried:

#include <cstddef>

extern "C" {
#pragma omp declare simd simdlen(16)
float expf(float x);
}

void func(float a[]) {
    #pragma omp simd
    for(std::size_t i = 0; i != 16; i++) {
        a[i] = expf(a[i] * 2.0f);
    }
}

The compiler doesn’t generate SIMD calls at all from this version of code.

Do you have any more ideas to solve this issue?