Open RKSimon opened 7 years ago
mentioned in issue llvm/llvm-bugzilla-archive#33898
I think prior to Sandy Bridge a flag read after a variable shift caused a flag stall that blocked the read from executing until the shift retired.
Nehalem didn't need a third source to process the flags because the flags never passed through the instruction. It gauranteed a flag stall instead.
I found confirmation on this when checking Intel's optimization manual: If Core2/Nehalem reads any flags produced by a shift with an immediate or cl
count:
The front end stalls until the instruction is retired. That would make shr %cl, %edi
/ cmove
REALLY bad, if the cmov can't even issue until the shift retires. So don't even think about it for -mtune=generic, but that doesn't rule it out for -mtune=sandybridge, which handles it totally differently.
The shift-by-1 opcode has special support, and does always produce flag outputs that other uops can have a data dependency on, so normal shift-bits-into-CF loops aren't horrible.
See my update to https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter/36510865#36510865 for more details and links.
Related fun fact: variable-count shifts are 3 uops on SnB-family, but according to Intel's optimization manual: When the flag result is not needed, one of these micro-ops may be discarded, providing better performance in many common usages.
But it can't be discarded until after its issued, so the front-end bandwidth is already spent. It may still be better to use the shift's flag results if you want them, instead of needing yet another instruction, if the false dependency on flags before the shift isn't a problem.
I think prior to Sandy Bridge a flag read after a variable shift caused a flag stall that blocked the read from executing until the shift retired.
Nehalem didn't need a third source to process the flags because the flags never passed through the instruction. It gauranteed a flag stall instead.
Also note, that more recently we use the BMI2 shifts instead which don't update the flags at all.
shrx / test / cmov may still be a win over mov ecx,edi / shr r,cl / cmov.
First by potentially saving mov instructions (non-destructive for i
, and doesn't need the count in cl).
Second, because variable-count shl/shr/sar on Intel SnB-family are 3 uops: 1 for the actual shift, and 2 flag-handling uops. The flag-latency of SHR is 2c for input_flags -> output_flags (http://www.agner.org/optimize/blog/read.php?i=415#860), but here we're more interested in the latency from r and cl to output_flags.
Anyway, shr r32,cl
is 3 uops for p06, but shrx + test is only 2 uops (and the TEST can run on any port). It loses on code-size, but could win on throughput if issue bandwidth is the bottleneck and larger code-size doesn't hurt uop-cache fetch or cause L1I misses.
Variable-count shifts are worse in Intel SnB-family than they were in Core2/Nehalem; IDK how they managed to make SHR r,cl a single-uop instruction there, since it still needs to preserve flags if cl is zero. But a uop can't have more than 2 inputs (until Haswell for FMA, and Broadwell for adc/cmov and a couple other things beyond FMA). This means it can be a good idea to use SHRX when available, even if you could use SHR with no extra MOV instructions.
Anyway, if I get around to it, I'll test Skylake's SHR latency from r,cl to output flags. If it's 1c, then that's lower latency than shrx/test and SHR might be a win when we can use its ZF output. (But still keep in mind that SHR still has a dependency on the input flags in cases where you know cl is non-zero, because the hardware still doesn't.)
If branching on the flags, test/jcc macro-fuses (on AMD and Intel), so SHR/JCC is only a small code-size saving vs. SHR/TEST/JCC. There's no saving in uops anywhere, except on CPUs like Intel pre-SnB which don't try hard in the decoders to fuse TEST/JCC across a decode-group.
re: cmpl $128, %edi vs. shr 0x7,%edi:
Immediate shifts are not problematic, only variable-count. So I think using flags from that destructive SHR is a good thing for setcc/cmovcc (but not jcc because it can't macro-fuse). If there's a lot of pressure on the shift ports (p0/p6 on Skylake), then CMP is good because it can run on any port. If the shift count was any lower, the CMP immediate would fit in an imm8, making it the clearly better choice (assuming you don't want the integer shift or sub result, only flags).
We do fold immediate non-zero shifts using isDefConvertible from X86InstrInfo.cpp.
Unfortunately, for the shift by 7 I'm guessing instcombine created that compare, but I didn't check.
You can't merge the shift and the test unless you can prove the shift is by a non-zero amount. Shifts of zero don't update flags.
Thanks I'd missed that. Updated test case still fails (I don't have a real world example of this, although I probably have variant cases which are known non-zero):
int test(unsigned i, unsigned j, int a, int b) {
if (0 == (i >> (j | 1)))
return a;
return b;
}
define i32 @test(i32, i32, i32, i32) {
%5 = or i32 %1, 1
%6 = lshr i32 %0, %5
%7 = icmp eq i32 %6, 0
%8 = select i1 %7, i32 %2, i32 %3
ret i32 %8
}
test(unsigned int, unsigned int, int, int):
orl $1, %esi
movl %ecx, %eax
movl %esi, %ecx
shrl %cl, %edi
testl %edi, %edi
cmovnel %eax, %edx
movl %edx, %eax
retq
And by constant:
int test7(unsigned i, unsigned j, int a, int b) {
if (0 == (i >> 7))
return a;
return b;
}
define i32 @test7(i32, i32, i32, i32) {
%5 = icmp ult i32 %0, 128
%6 = select i1 %5, i32 %2, i32 %3
ret i32 %6
}
test7(unsigned int, unsigned int, int, int):
cmpl $128, %edi
cmovael %ecx, %edx
movl %edx, %eax
retq
Although the CMP takes 6 bytes, while using SHR would only take 3 bytes (but would be destructive):
test7(unsigned int, unsigned int, int, int):
shr $0x7,%edi
mov %ecx,%eax
cmove %edx,%eax
retq
Looks like the two address instruction pass analyzes the cmov and sees that the output ultimately needs to go to eax, src1 comes from ecx, and src2 comes from edx. But it doesn't see that the shift needs cl and thus the src map is wrong.
Looks like we had CMOVE before the peep optimizer went crazy and decided to swap it.
You can't merge the shift and the test unless you can prove the shift is by a non-zero amount. Shifts of zero don't update flags.
The first point is fixed, final mov is not there now - https://gcc.godbolt.org/z/4Yc5G7fMq.
Extended Description
llc -mcpu=btver2
1 - we could remove the final move if the CMOVNE was changed/commuted to CMOVE 2 - why aren't we using the ZF flag result from SHR instead of the extra TEST?
gcc gets (1) right, but still doesn't do (2)
Also note, that more recently we use the BMI2 shifts instead which don't update the flags at all.