llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.51k stars 11.3k forks source link

[X86] `@llvm.ceil.f16` is ~6x slower than GCC on Intel Raptor Lake #98630

Open overmighty opened 1 month ago

overmighty commented 1 month ago

https://godbolt.org/z/vc4Y1r6Mq

C++ code:

_Float16 foo(_Float16 x) {
    return static_cast<_Float16>(__builtin_ceilf(x));
}

GCC output with -O3 -march=raptorlake -fno-omit-frame-pointer (takes 1.33-1.64 ns on i7-13700H):

foo(_Float16):
        vpxor   xmm1, xmm1, xmm1
        vpblendw        xmm0, xmm1, xmm0, 1
        vcvtph2ps       xmm0, xmm0
        vroundss        xmm0, xmm0, xmm0, 10
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vcvtps2ph       xmm0, xmm0, 4
        ret

Clang output with -O3 -march=raptorlake -fno-omit-frame-pointer (takes ~9.12 ns on i7-13700H):

foo(_Float16):                            # @foo(_Float16)
        push    rbp
        mov     rbp, rsp
        vpextrw eax, xmm0, 0
        vmovd   xmm0, eax
        vcvtph2ps       xmm0, xmm0
        vroundss        xmm0, xmm0, xmm0, 10
        vcvtps2ph       xmm0, xmm0, 4
        vmovd   eax, xmm0
        vpinsrw xmm0, xmm0, eax, 0
        pop     rbp
        ret
llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-x86

Author: OverMighty (overmighty)

https://godbolt.org/z/vc4Y1r6Mq C++ code: ```cpp _Float16 foo(_Float16 x) { return static_cast<_Float16>(__builtin_ceilf(x)); } ``` GCC output with `-O3 -march=raptorlake -fno-omit-frame-pointer` (takes 1.33-1.64 ns on i7-13700H): ```asm foo(_Float16): vpxor xmm1, xmm1, xmm1 vpblendw xmm0, xmm1, xmm0, 1 vcvtph2ps xmm0, xmm0 vroundss xmm0, xmm0, xmm0, 10 vinsertps xmm0, xmm0, xmm0, 0xe vcvtps2ph xmm0, xmm0, 4 ret ``` Clang output with `-O3 -march=raptorlake -fno-omit-frame-pointer` (takes ~9.12 ns on i7-13700H): ```asm foo(_Float16): # @foo(_Float16) push rbp mov rbp, rsp vpextrw eax, xmm0, 0 vmovd xmm0, eax vcvtph2ps xmm0, xmm0 vroundss xmm0, xmm0, xmm0, 10 vcvtps2ph xmm0, xmm0, 4 vmovd eax, xmm0 vpinsrw xmm0, xmm0, eax, 0 pop rbp ret ```
RKSimon commented 1 month ago

Something simpler like fabs bit-twiddle is just as bad:

_Float16 abs_f16(_Float16 x) {
    return static_cast<_Float16>(__builtin_elementwise_abs(x));
}
abs_f16(_Float16):                        # @abs_f16(_Float16)
        vpextrw $0, %xmm0, %eax
        vmovd   %eax, %xmm0
        vcvtph2ps       %xmm0, %xmm0
        vandps  .LCPI0_0(%rip), %xmm0, %xmm0
        vcvtps2ph       $4, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        vpinsrw $0, %eax, %xmm0, %xmm0
        retq
jhuber6 commented 1 month ago

This should probably be using __builtin_ceilf16 but InstCombine takes care of it anyway. I'm wondering why this case needs to set up a stack frame since it doesn't look used, https://godbolt.org/z/v4Y1eh5hv.

overmighty commented 1 month ago

I used __builtin_ceilf because GCC without -mavx512fp16 handles __builtin_ceilf16 by just generating a call to the libc's ceilf16.

andykaylor commented 1 month ago

@FreddyLeaf @phoebewang Can you look at this?

phoebewang commented 1 month ago

The GCC code generation is suboptimal either. The vcvtph2ps/vinsertps can be eliminated too. Clang here handles it for calling convention more stiffly. It would get better if we exclude the ABI handling https://godbolt.org/z/Wcahzb79E.

That says, the support of FP16 on targets without AVX512FP16 is mainly for verification. The performance is not a urgent goal for us, though it's always better to improve.