Open davidbolvansky opened 5 years ago
int minmax_rev(int num, int n) {
int t = std::min(n, num);
return std::max(num - t, t);
}
GCC with 2.3 (vs. 2.8)
int minmax_rev(int num, int n, int y) {
int t = std::min(n, num);
return std::max(num ^ n, t);
}
2.3 vs 3.0
int minmax_rev(int num, int n, int y) {
int t = std::min(n, num);
return std::max(y, t);
}
But... GCC has 3.0, cmov haswell = 2.5
So it is probably not trivial to get this right.
Its not just on the SSE41 target feature - btver2 doesn't use VPMAXSD for instance: https://godbolt.org/z/4EH68l
The problem in the description could be handled by InstSimplify.
I'm not sure which commit did it, but both of the potential instsimplify examples (description and comment 1) are handled now in trunk: https://godbolt.org/z/CtolTx
llvm-mca -mcpu=haswell
_Z10minmax_revii: # @_Z10minmax_revii
mov eax, esi
cmp edi, esi
cmovg edi, esi
add eax, -6
cmp eax, edi
cmovl eax, edi
ret
Dispatch Width: 4
uOps Per Cycle: 3.56
IPC: 2.27
Block RThroughput: 2.8
_Z10minmax_revii:
vmovdqa xmm0, XMMWORD PTR .LC0[rip]
vmovd xmm1, esi
vmovd xmm2, edi
vpaddd xmm0, xmm1, xmm0
vpminsd xmm1, xmm1, xmm2
vpmaxsd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
.LC0:
.long -6
.long 0
.long 0
.long 0
Dispatch Width: 4
uOps Per Cycle: 3.24
IPC: 2.59
Block RThroughput: 2.5
So it is a win for Haswell. As Craig said, for Broadwell it is a pessimization.
Looks like gcc does this on earlier CPUs that Haswell. It's like tied to sse4.1 where pmaxsd/pminsd were introduced.
Broadwell improved CMOV latency from 2 cycles to 1 cycle and reduced it from 2 uops to 1. Except for CMOVBE/CMOVA which went from 3 cycles to 2 cycle and 3 uops to 2 uops.
Yes, but reality ...
hmmer's loop: https://godbolt.org/z/cg0aZs
You can read discussion here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
GCC started to emit vpmaxsd even for scalars for Haswell and newer. They measured it is faster than cmov.
CMOV/PMAXSD codegen: https://godbolt.org/z/4VM_sN
GCC started to emit vpmaxsd even for scalars for Haswell and newer. They measured it is faster than cmov.
hmmer with GCC got a big improvement, overall SPEC score +1.
https://gcc.opensuse.org/gcc-old/SPEC/CINT/sb-czerny-head-64-2006/recent.html
The problem in the description could be handled by InstSimplify. I'm pretty sure that we have incomplete code that tries to do that transform in InstCombine though.
It's a bit awkward in InstSimplify because we pass in the arguments of a select rather than the select, so then we can't use ValueTracking's matchSelectPattern to know that we have a min/max.
Given that, we should probably solve this in InstCombine using ConstantRange.
Similar not optimized case:
int minmax2(int num) {
int t = std::min(10, num);
return std::max(15, t);
}
As side note, GCC has a smarter code for case
int minmax_rev(int num) {
int t = std::min(15 /* n */, num);
return std::max(14 /* n - 1 */, t);
}
GCC:
minmax_rev(int):
xor eax, eax
cmp edi, 14
setg al
add eax, 14
ret
Clang:
minmax_rev(int): # @minmax_rev(int)
cmp edi, 16
mov ecx, 15
cmovl ecx, edi
cmp ecx, 13
mov eax, 14
cmovg eax, ecx
ret
Another case:
int minmax_rev(int num, int n) {
int t = std::min(n, num);
return std::max(n - 6, t);
}
GCC uses lea here and saves one instruction.
@llvm/issue-subscribers-backend-x86
Author: Dávid Bolvanský (davidbolvansky)
Extended Description
Current codegen -O3, x86-64:
We can fold it to 1 (GCC does it).