llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.61k stars 11.82k forks source link

[X86] any_extend is pessimized by selection #41489

Open LebedevRI opened 5 years ago

LebedevRI commented 5 years ago
Bugzilla Link 42144
Version trunk
OS Linux
CC @topperc,@RKSimon,@rotateright

Extended Description

define dso_local zeroext i1 @_Z2t0jj(i32, i32) local_unnamed_addr #0 {
  %3 = add i32 %1, -1
  %4 = shl i32 1, %3
  %5 = and i32 %4, %0
  %6 = icmp ne i32 %5, 0
  ret i1 %6
}

results in:

# %bb.0:
        decb    %sil
        movzbl  %sil, %eax
        btl     %eax, %edi
        setb    %al
        retq

That movzbl shouldn't be there.

Combining: t0: ch = EntryToken
Optimized legalized selection DAG: %bb.0 '_Z2t0jj:'
SelectionDAG has 16 nodes:
  t0: ch = EntryToken
        t2: i32,ch = CopyFromReg t0, Register:i32 %0
              t4: i32,ch = CopyFromReg t0, Register:i32 %1
            t21: i8 = truncate t4
          t23: i8 = add t21, Constant:i8<-1>
        t27: i32 = any_extend t23
      t29: i32 = X86ISD::BT t2, t27
    t30: i8 = X86ISD::SETCC Constant:i8<2>, t29
  t17: ch,glue = CopyToReg t0, Register:i8 $al, t30
  t18: ch = X86ISD::RET_FLAG t17, TargetConstant:i32<0>, Register:i8 $al, t17:1

ISEL: Starting selection on root node: t27: i32 = any_extend t23
ISEL: Starting pattern match
  Initial Opcode index to 126894
  Match failed at index 126898
  Continuing at 127264
  TypeSwitch[i32] from 127267 to 127270
  Skipped scope entry (due to false predicate) at index 127276, continuing at 127285
  Morphed node: t27: i32 = MOVZX32rr8 t23
ISEL: Match complete!

===== Instruction selection ends:
Selected selection DAG: %bb.0 '_Z2t0jj:'
SelectionDAG has 18 nodes:
  t0: ch = EntryToken
          t2: i32,ch = CopyFromReg t0, Register:i32 %0
                t4: i32,ch = CopyFromReg t0, Register:i32 %1
              t21: i8 = EXTRACT_SUBREG t4, TargetConstant:i32<1>
            t23: i8,i32 = DEC8r t21
          t27: i32 = MOVZX32rr8 t23
        t29: i32 = BT32rr t2, t27
      t33: ch,glue = CopyToReg t0, Register:i32 $eflags, t29
    t30: i8 = SETCCr TargetConstant:i8<2>, t33:1
  t17: ch,glue = CopyToReg t0, Register:i8 $al, t30
  t18: ch = RET TargetConstant:i32<0>, Register:i8 $al, t17, t17:1

I'm not quite sure where to look though

topperc commented 5 years ago

SelectionDAGBuilder inserted a trunc before the shl during DAG construction. This caused DAGCombine to shrink the add -1 to 8-bits because sTypeDesirableForOp didn't tell it not to. Lowering for the icmp created the BT from the and/shl, but needed to extend the register type back to 32-bits to match the register size required for the BT count input.

I'm pretty sure that even though BT only needs 5 or 6 bits for the count, the partial register handling in the frontend and register renaming does track it as an access to the full register 16/32/64-bit register. This means on Sandy Bridge and later, it will trigger an H-register merge if needed. I think the shift amounts for SHLX/SARX/SHRX are the same. The old shift instructions that always use CL will not trigger an H-register merge.

Promote all the things is probably the best way to fix this.

rotateright commented 5 years ago

Not sure if it's the same, but I was looking at a similar problem that led me to this bit of X86InstrCompiler.td:

// anyext. Define these to do an explicit zero-extend to // avoid partial-register updates.

To avoid this, the big change that I think we want -- but probably requires many intermediate fixes -- is to promote all 8-bit ops to 32-bit.

pcordes commented 2 years ago

Promote all the things is probably the best way to fix this.

Yes, since narrow function args are (unofficially) required to be extended to 32-bit, reading ESI instead of SIL won't cause a partial-register stall. (Which is only a thing on obsolete P6-family anyway; later uarches have false dependencies when writing. And SnB itself has efficient merging).

Of course, if we have efficient SHRX, I think it's better not to go through FLAGS at all, like how we compile for ARM64 (https://godbolt.org/z/a9xxTPjWb) with SUB/LSR/AND:

  dec  %esi
  shrx %esi, %edi, %eax
  and  $1, %eax

Variable-count shifts without BMI2 are unfortunately more expensive on Intel SnB-family CPUs than bt, but without -mbmi2 but with -mtune=znver1 (or any earlier AMD that don't have BMI2 like bdver1 or k10), we could use

  lea -1(%rsi), %ecx
  mov  %edi, %eax
  shr  %cl, %eax
  and  $1, %eax

(Or to take the mov %edi, %eax off the critical path, mov $1, %eax / and %edi, %eax, but that costs a few extra bytes of code size.)


There's also the possibility of shifting by 1 more than the count, to shift the bit into CF. But x86 masks the shift count, so a & (1<<(32 - 1)) is well-defined but not equivalent to CF after a >> 32, because that's the same as a >> 0 which leaves FLAGS unmodified. But if there's a wider operand-size available, we can use that:

 mov %esi, %ecx
 shr  %cl, %rdi       # note 64-bit operand-size so a shift count of 31+1 = 32 still works.
 setc %al               # partial register write false dependency: we haven't written EAX as part of this dep chain

IDK if this could be useful; maybe on AMD. On P6-family, reading a FLAG result after a variable-count shift stalls the front-end until the shift retires, vs. being 3 uops on Sandybridge-family for FLAGS-merging. (That's how they managed to make it 1 uop on P6 while still respecting the fact that shifts with count=0 must leave FLAGS unmodified, annoying CISC semantics that are basically part of the x86 tax for modern implementations).