llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28k stars 11.56k forks source link

vector ctlz should intelligently switch to scalar algorithm #39020

Open llvmbot opened 5 years ago

llvmbot commented 5 years ago
Bugzilla Link 39672
Version trunk
OS MacOS X
Reporter LLVM Bugzilla Contributor
CC @alexey-bataev,@RKSimon,@rotateright

Extended Description

With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz intrinsics can actually be slower than a scalar implementation.

The following code produces a complicated divide-and-conquer algorithm (with -O3 -mcpu=sandybridge):

define <2 x i64> @foo(<2 x i64>) local_unnamed_addr #0 {
    %2 = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %0, i1 true)
    ret <2 x i64> %2
}

Which produces the following assembly:

.LCPI0_0:
  .zero 16,15
.LCPI0_1:
  .byte 4 # 0x4
  .byte 3 # 0x3
  .byte 2 # 0x2
  .byte 2 # 0x2
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
foo: # @&#8203;foo
  vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
  vpand xmm2, xmm0, xmm1
  vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1] # xmm3 = [4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
  vpshufb xmm2, xmm3, xmm2
  vpsrlw xmm4, xmm0, 4
  vpand xmm1, xmm4, xmm1
  vpxor xmm4, xmm4, xmm4
  vpcmpeqb xmm5, xmm1, xmm4
  vpand xmm2, xmm2, xmm5
  vpshufb xmm1, xmm3, xmm1
  vpaddb xmm1, xmm2, xmm1
  vpcmpeqb xmm2, xmm0, xmm4
  vpsrlw xmm2, xmm2, 8
  vpand xmm2, xmm1, xmm2
  vpsrlw xmm1, xmm1, 8
  vpaddw xmm1, xmm1, xmm2
  vpcmpeqw xmm2, xmm0, xmm4
  vpsrld xmm2, xmm2, 16
  vpand xmm2, xmm1, xmm2
  vpsrld xmm1, xmm1, 16
  vpaddd xmm1, xmm1, xmm2
  vpcmpeqd xmm0, xmm0, xmm4
  vpsrlq xmm0, xmm0, 32
  vpand xmm0, xmm1, xmm0
  vpsrlq xmm1, xmm1, 32
  vpaddq xmm0, xmm1, xmm0
  ret

The scalar algorithm is merely something like:

foo: # @foo
  vmovdqa xmm0, xmmword ptr [rdi]
  vmovq rax, xmm0
  vpextrq rcx, xmm0, 1
  bsr rax, rax
  xor rax, 63
  bsr rcx, rcx
  xor rcx, 63
  vmovq xmm0, rcx
  vmovq xmm1, rax
  vpunpcklqdq xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
  vmovdqa xmmword ptr [rdi], xmm0
  ret

compare: https://godbolt.org/z/ZdW3Yj

I have verified on my machine (sandybridge) that the scalar algorithm is significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.

RKSimon commented 5 years ago

DAGCombiner is too late to make a decision about this, we need a TTI cost model driven scalarization stage - not sure if this is the kind of thing we'd want in SLP?