[X86] Missed optimizations for comparison of shift against zero

RKSimon commented 7 years ago


Bugzilla Link	33879
Version	trunk
OS	Windows NT
CC	@topperc,@rotateright

Extended Description

int test(unsigned i, unsigned j, int a, int b) {
  if (0 == (i >> j))
    return a;
  return b;
}

define i32 @test(i32, i32, i32, i32) {
  %5 = lshr i32 %0, %1
  %6 = icmp eq i32 %5, 0
  %7 = select i1 %6, i32 %2, i32 %3
  ret i32 %7
}

llc -mcpu=btver2

test(unsigned int, unsigned int, int, int):
        movl    %ecx, %eax
        movl    %esi, %ecx
        shrl    %cl, %edi
        testl   %edi, %edi
        cmovnel %eax, %edx
        movl    %edx, %eax
        retq

1 - we could remove the final move if the CMOVNE was changed/commuted to CMOVE 2 - why aren't we using the ZF flag result from SHR instead of the extra TEST?

gcc gets (1) right, but still doesn't do (2)

test(unsigned int, unsigned int, int, int):
        movl    %ecx, %eax
        movl    %esi, %ecx
        shrl    %cl, %edi
        testl   %edi, %edi
        cmove   %edx, %eax
        ret

Also note, that more recently we use the BMI2 shifts instead which don't update the flags at all.

llvmbot commented 3 years ago

mentioned in issue llvm/llvm-bugzilla-archive#33898

llvmbot commented 7 years ago

I think prior to Sandy Bridge a flag read after a variable shift caused a flag stall that blocked the read from executing until the shift retired.

Nehalem didn't need a third source to process the flags because the flags never passed through the instruction. It gauranteed a flag stall instead.

I found confirmation on this when checking Intel's optimization manual: If Core2/Nehalem reads any flags produced by a shift with an immediate or cl count:

The front end stalls until the instruction is retired. That would make shr %cl, %edi / cmove REALLY bad, if the cmov can't even issue until the shift retires. So don't even think about it for -mtune=generic, but that doesn't rule it out for -mtune=sandybridge, which handles it totally differently.

The shift-by-1 opcode has special support, and does always produce flag outputs that other uops can have a data dependency on, so normal shift-bits-into-CF loops aren't horrible.

See my update to https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter/36510865#36510865 for more details and links.

Related fun fact: variable-count shifts are 3 uops on SnB-family, but according to Intel's optimization manual: When the flag result is not needed, one of these micro-ops may be discarded, providing better performance in many common usages.

But it can't be discarded until after its issued, so the front-end bandwidth is already spent. It may still be better to use the shift's flag results if you want them, instead of needing yet another instruction, if the false dependency on flags before the shift isn't a problem.

topperc commented 7 years ago

I think prior to Sandy Bridge a flag read after a variable shift caused a flag stall that blocked the read from executing until the shift retired.

Nehalem didn't need a third source to process the flags because the flags never passed through the instruction. It gauranteed a flag stall instead.

llvmbot commented 7 years ago

Also note, that more recently we use the BMI2 shifts instead which don't update the flags at all.

shrx / test / cmov may still be a win over mov ecx,edi / shr r,cl / cmov.

First by potentially saving mov instructions (non-destructive for i, and doesn't need the count in cl).

Second, because variable-count shl/shr/sar on Intel SnB-family are 3 uops: 1 for the actual shift, and 2 flag-handling uops. The flag-latency of SHR is 2c for input_flags -> output_flags (http://www.agner.org/optimize/blog/read.php?i=415#860), but here we're more interested in the latency from r and cl to output_flags.

Anyway, shr r32,cl is 3 uops for p06, but shrx + test is only 2 uops (and the TEST can run on any port). It loses on code-size, but could win on throughput if issue bandwidth is the bottleneck and larger code-size doesn't hurt uop-cache fetch or cause L1I misses.

Variable-count shifts are worse in Intel SnB-family than they were in Core2/Nehalem; IDK how they managed to make SHR r,cl a single-uop instruction there, since it still needs to preserve flags if cl is zero. But a uop can't have more than 2 inputs (until Haswell for FMA, and Broadwell for adc/cmov and a couple other things beyond FMA). This means it can be a good idea to use SHRX when available, even if you could use SHR with no extra MOV instructions.

Anyway, if I get around to it, I'll test Skylake's SHR latency from r,cl to output flags. If it's 1c, then that's lower latency than shrx/test and SHR might be a win when we can use its ZF output. (But still keep in mind that SHR still has a dependency on the input flags in cases where you know cl is non-zero, because the hardware still doesn't.)

If branching on the flags, test/jcc macro-fuses (on AMD and Intel), so SHR/JCC is only a small code-size saving vs. SHR/TEST/JCC. There's no saving in uops anywhere, except on CPUs like Intel pre-SnB which don't try hard in the decoders to fuse TEST/JCC across a decode-group.

re: cmpl $128, %edi vs. shr 0x7,%edi:

Immediate shifts are not problematic, only variable-count. So I think using flags from that destructive SHR is a good thing for setcc/cmovcc (but not jcc because it can't macro-fuse). If there's a lot of pressure on the shift ports (p0/p6 on Skylake), then CMP is good because it can run on any port. If the shift count was any lower, the CMP immediate would fit in an imm8, making it the clearly better choice (assuming you don't want the integer shift or sub result, only flags).

topperc commented 7 years ago

We do fold immediate non-zero shifts using isDefConvertible from X86InstrInfo.cpp.

Unfortunately, for the shift by 7 I'm guessing instcombine created that compare, but I didn't check.

RKSimon commented 7 years ago

You can't merge the shift and the test unless you can prove the shift is by a non-zero amount. Shifts of zero don't update flags.

Thanks I'd missed that. Updated test case still fails (I don't have a real world example of this, although I probably have variant cases which are known non-zero):

int test(unsigned i, unsigned j, int a, int b) {
  if (0 == (i >> (j | 1)))
    return a;
  return b;
}

define i32 @test(i32, i32, i32, i32) {
  %5 = or i32 %1, 1
  %6 = lshr i32 %0, %5
  %7 = icmp eq i32 %6, 0
  %8 = select i1 %7, i32 %2, i32 %3
  ret i32 %8
}

test(unsigned int, unsigned int, int, int):
        orl     $1, %esi
        movl    %ecx, %eax
        movl    %esi, %ecx
        shrl    %cl, %edi
        testl   %edi, %edi
        cmovnel %eax, %edx
        movl    %edx, %eax
        retq

And by constant:

int test7(unsigned i, unsigned j, int a, int b) {
  if (0 == (i >> 7))
    return a;
  return b;
}

define i32 @test7(i32, i32, i32, i32) {
  %5 = icmp ult i32 %0, 128
  %6 = select i1 %5, i32 %2, i32 %3
  ret i32 %6
}

test7(unsigned int, unsigned int, int, int):
  cmpl    $128, %edi
  cmovael %ecx, %edx
  movl    %edx, %eax
  retq

Although the CMP takes 6 bytes, while using SHR would only take 3 bytes (but would be destructive):

test7(unsigned int, unsigned int, int, int):
 shr    $0x7,%edi
 mov    %ecx,%eax
 cmove  %edx,%eax
 retq

topperc commented 7 years ago

Looks like the two address instruction pass analyzes the cmov and sees that the output ultimately needs to go to eax, src1 comes from ecx, and src2 comes from edx. But it doesn't see that the shift needs cl and thus the src map is wrong.

topperc commented 7 years ago

Looks like we had CMOVE before the peep optimizer went crazy and decided to swap it.

topperc commented 7 years ago

You can't merge the shift and the test unless you can prove the shift is by a non-zero amount. Shifts of zero don't update flags.

xgupta commented 1 year ago

The first point is fixed, final mov is not there now - https://gcc.godbolt.org/z/4Yc5G7fMq.

llvm / llvm-project

[X86] Missed optimizations for comparison of shift against zero #33226

Extended Description