llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.1k stars 12.01k forks source link

LLVM is unable to combine multiple equivalent UCOMISDrr instructions #42433

Open nagisa opened 5 years ago

nagisa commented 5 years ago
Bugzilla Link 43088
Version trunk
OS Windows NT
CC @topperc,@efriedma-quic,@RKSimon,@rotateright

Extended Description

Given code like this (keep in mind that this applies to most code examples where there are multiple fcmp instructions with the same arguments but differing comparison code):

define i8 @test(double, double) unnamed_addr #0 {
start:
    %2 = fcmp ole double %0, %1
    %3 = fcmp oge double %0, %1
    %spec.select1.i = select i1 %3, i8 1, i8 2
    %not..i = xor i1 %3, true
    %spec.select.i = sext i1 %not..i to i8
    %_0.0.i = select i1 %2, i8 %spec.select.i, i8 %spec.select1.i
    ret i8 %_0.0.i
}

which returns -1, 0, 1, or 2 depending on whether the comparison between the two arguments is less-than, equal, greater-than or either-is-nan. On x86 this information is made available by a single execution of the UCOMISD instruction.

However this IR ends up being built into something like:

bb.0.start:
  successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%)
  liveins: $xmm0, $xmm1
  %1:fr64 = COPY $xmm1
  %0:fr64 = COPY $xmm0
  UCOMISDrr %0:fr64, %1:fr64, implicit-def $eflags
  %2:gr8 = SETAEr implicit $eflags
  %3:gr8 = MOV8ri 2
  %4:gr8 = SUB8rr %3:gr8(tied-def 0), %2:gr8, implicit-def dead $eflags
  %5:gr8 = DEC8r %2:gr8(tied-def 0), implicit-def dead $eflags
  UCOMISDrr %1:fr64, %0:fr64, implicit-def $eflags
  JAE_1 %bb.2, implicit $eflags

bb.1.start:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)

bb.2.start:
; predecessors: %bb.0, %bb.1

  %6:gr8 = PHI %4:gr8, %bb.1, %5:gr8, %bb.0
  $al = COPY %6:gr8
  RET 0, $al

and the duplicate UCOMISDrr end up never being removed.

RKSimon commented 5 years ago

Current Codegen: https://godbolt.org/z/URP0xP

As Eli said, on trunk the codegen is branchless:

test: # @test
  vucomisd %xmm1, %xmm0
  movb $2, %cl
  setae %al
  subb %al, %cl
  decb %al
  vucomisd %xmm0, %xmm1
  movzbl %al, %edx
  movzbl %cl, %eax
  cmovael %edx, %eax
  retq
efriedma-quic commented 5 years ago

I was trying trunk, not 8.0. We generate a branchless sequence on trunk. If we convert the second select into a branch, maybe we can do something.

Replacing a ucomisd with a setcc+test doesn't seem profitable, unless ucomisd is really expensive on some CPUs.

nagisa commented 5 years ago

I believe that the generated code could simply SETcc most of the flags it needs into general purpose registers before actually doing flag-clobbering computations on them.

In this particular example, however, the fact that flag clobbering does not matter at all is easily visible from the generated assembly which looks like this:

test:
# %bb.0:                                # %start
    ucomisd %xmm1, %xmm0
    setae   %al
    ucomisd %xmm0, %xmm1
    jae .LBB0_1
# %bb.2:                                # %start
    movb    $2, %cl
    subb    %al, %cl
    movl    %ecx, %eax
    retq
.LBB0_1:
    decb    %al
    retq

Here the 2nd ucomisd is trivially removable with the only other change that is necessary is inversion of the cc in Jcc instruction that follows.

efriedma-quic commented 5 years ago

I'm not sure how you expect the second UCOMISDrr to be removed; the SUB8rr and DEC8r clobber EFLAGS. I guess we could try to use LEA instead.

RKSimon commented 2 years ago

With latest we have better eflags handling, but now have 3 vucomisd ops:

test: # @test
  xorl %eax, %eax
  vucomisd %xmm1, %xmm0
  adcb $1, %al
  vucomisd %xmm1, %xmm0
  movzbl %al, %eax
  sbbl %ecx, %ecx
  vucomisd %xmm0, %xmm1
  cmovael %ecx, %eax
  retq