Missed scalar min max optimization using vector min/max on Haswelll

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

27.21k stars 11.14k forks source link

Missed scalar min max optimization using vector min/max on Haswelll #39245

Open davidbolvansky opened 5 years ago

davidbolvansky commented 5 years ago


Bugzilla Link	39898
Version	trunk
OS	Linux
CC	@topperc,@RKSimon,@rotateright

Extended Description

int minmax(int num) {
    int t = std::max(15, num);
    return std::min(1, t);
}

Current codegen -O3, x86-64:

minmax(int):                             # @minmax(int)
        cmp     edi, 14
        mov     ecx, 15
        cmovg   ecx, edi
        cmp     ecx, 2
        mov     eax, 1
        cmovl   eax, ecx
        ret

We can fold it to 1 (GCC does it).

davidbolvansky commented 4 years ago

int minmax_rev(int num, int n) {
    int t = std::min(n, num);
    return std::max(num - t, t);
}

GCC with 2.3 (vs. 2.8)

int minmax_rev(int num, int n, int y) {
    int t = std::min(n, num);
    return std::max(num ^ n, t);
}

2.3 vs 3.0

int minmax_rev(int num, int n, int y) {
    int t = std::min(n, num);
    return std::max(y, t);
}

But... GCC has 3.0, cmov haswell = 2.5

So it is probably not trivial to get this right.

RKSimon commented 4 years ago

Its not just on the SSE41 target feature - btver2 doesn't use VPMAXSD for instance: https://godbolt.org/z/4EH68l

rotateright commented 4 years ago

The problem in the description could be handled by InstSimplify.

I'm not sure which commit did it, but both of the potential instsimplify examples (description and comment 1) are handled now in trunk: https://godbolt.org/z/CtolTx

davidbolvansky commented 4 years ago

llvm-mca -mcpu=haswell

_Z10minmax_revii: # @_Z10minmax_revii
  mov eax, esi
  cmp edi, esi
  cmovg edi, esi
  add eax, -6
  cmp eax, edi
  cmovl eax, edi
  ret

Dispatch Width:    4
uOps Per Cycle:    3.56
IPC:               2.27
Block RThroughput: 2.8

_Z10minmax_revii:
  vmovdqa xmm0, XMMWORD PTR .LC0[rip]
  vmovd xmm1, esi
  vmovd xmm2, edi
  vpaddd xmm0, xmm1, xmm0
  vpminsd xmm1, xmm1, xmm2
  vpmaxsd xmm0, xmm0, xmm1
  vmovd eax, xmm0
  ret
.LC0:
  .long -6
  .long 0
  .long 0
  .long 0

Dispatch Width:    4
uOps Per Cycle:    3.24
IPC:               2.59
Block RThroughput: 2.5

So it is a win for Haswell. As Craig said, for Broadwell it is a pessimization.

topperc commented 4 years ago

Looks like gcc does this on earlier CPUs that Haswell. It's like tied to sse4.1 where pmaxsd/pminsd were introduced.

Broadwell improved CMOV latency from 2 cycles to 1 cycle and reduced it from 2 uops to 1. Except for CMOVBE/CMOVA which went from 3 cycles to 2 cycle and 3 uops to 2 uops.

davidbolvansky commented 4 years ago

Yes, but reality ...

hmmer's loop: https://godbolt.org/z/cg0aZs

You can read discussion here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154

RKSimon commented 4 years ago

GCC started to emit vpmaxsd even for scalars for Haswell and newer. They measured it is faster than cmov.

CMOV/PMAXSD codegen: https://godbolt.org/z/4VM_sN

davidbolvansky commented 4 years ago

GCC started to emit vpmaxsd even for scalars for Haswell and newer. They measured it is faster than cmov.

hmmer with GCC got a big improvement, overall SPEC score +1.

https://gcc.opensuse.org/gcc-old/SPEC/CINT/sb-czerny-head-64-2006/recent.html

rotateright commented 5 years ago

The problem in the description could be handled by InstSimplify. I'm pretty sure that we have incomplete code that tries to do that transform in InstCombine though.

It's a bit awkward in InstSimplify because we pass in the arguments of a select rather than the select, so then we can't use ValueTracking's matchSelectPattern to know that we have a min/max.

Given that, we should probably solve this in InstCombine using ConstantRange.

davidbolvansky commented 5 years ago

Similar not optimized case:

int minmax2(int num) {
    int t = std::min(10, num);
    return std::max(15, t);
}

As side note, GCC has a smarter code for case

int minmax_rev(int num) {
    int t = std::min(15 /* n */, num);
    return std::max(14 /* n - 1 */, t);
}

GCC:

minmax_rev(int):
        xor     eax, eax
        cmp     edi, 14
        setg    al
        add     eax, 14
        ret

Clang:

minmax_rev(int):                        # @minmax_rev(int)
        cmp     edi, 16
        mov     ecx, 15
        cmovl   ecx, edi
        cmp     ecx, 13
        mov     eax, 14
        cmovg   eax, ecx
        ret

Another case:

int minmax_rev(int num, int n) {
    int t = std::min(n, num);
    return std::max(n - 6, t);
}

GCC uses lea here and saves one instruction.

llvmbot commented 3 weeks ago

@llvm/issue-subscribers-backend-x86

Author: Dávid Bolvanský (davidbolvansky)

| | | | --- | --- | | Bugzilla Link | [39898](https://llvm.org/bz39898) | | Version | trunk | | OS | Linux | | CC | @topperc,@RKSimon,@rotateright | ## Extended Description int minmax(int num) { int t = std::max(15, num); return std::min(1, t); } Current codegen -O3, x86-64: minmax(int): # @minmax(int) cmp edi, 14 mov ecx, 15 cmovg ecx, edi cmp ecx, 2 mov eax, 1 cmovl eax, ecx ret We can fold it to 1 (GCC does it).