vector ctlz should intelligently switch to scalar algorithm


Bugzilla Link	39672
Version	trunk
OS	MacOS X
Reporter	LLVM Bugzilla Contributor
CC	@alexey-bataev,@RKSimon,@rotateright

Extended Description

With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz intrinsics can actually be slower than a scalar implementation.

The following code produces a complicated divide-and-conquer algorithm (with -O3 -mcpu=sandybridge):

define <2 x i64> @foo(<2 x i64>) local_unnamed_addr #0 {
    %2 = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %0, i1 true)
    ret <2 x i64> %2
}

Which produces the following assembly:

.LCPI0_0:
  .zero 16,15
.LCPI0_1:
  .byte 4 # 0x4
  .byte 3 # 0x3
  .byte 2 # 0x2
  .byte 2 # 0x2
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
foo: # @&#8203;foo
  vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
  vpand xmm2, xmm0, xmm1
  vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1] # xmm3 = [4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
  vpshufb xmm2, xmm3, xmm2
  vpsrlw xmm4, xmm0, 4
  vpand xmm1, xmm4, xmm1
  vpxor xmm4, xmm4, xmm4
  vpcmpeqb xmm5, xmm1, xmm4
  vpand xmm2, xmm2, xmm5
  vpshufb xmm1, xmm3, xmm1
  vpaddb xmm1, xmm2, xmm1
  vpcmpeqb xmm2, xmm0, xmm4
  vpsrlw xmm2, xmm2, 8
  vpand xmm2, xmm1, xmm2
  vpsrlw xmm1, xmm1, 8
  vpaddw xmm1, xmm1, xmm2
  vpcmpeqw xmm2, xmm0, xmm4
  vpsrld xmm2, xmm2, 16
  vpand xmm2, xmm1, xmm2
  vpsrld xmm1, xmm1, 16
  vpaddd xmm1, xmm1, xmm2
  vpcmpeqd xmm0, xmm0, xmm4
  vpsrlq xmm0, xmm0, 32
  vpand xmm0, xmm1, xmm0
  vpsrlq xmm1, xmm1, 32
  vpaddq xmm0, xmm1, xmm0
  ret

The scalar algorithm is merely something like:

foo: # @foo
  vmovdqa xmm0, xmmword ptr [rdi]
  vmovq rax, xmm0
  vpextrq rcx, xmm0, 1
  bsr rax, rax
  xor rax, 63
  bsr rcx, rcx
  xor rcx, 63
  vmovq xmm0, rcx
  vmovq xmm1, rax
  vpunpcklqdq xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
  vmovdqa xmmword ptr [rdi], xmm0
  ret

compare: https://godbolt.org/z/ZdW3Yj

I have verified on my machine (sandybridge) that the scalar algorithm is significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.

MacBook Pro (13-inch, Early 2011)
macOS 10.13.6

llvm / llvm-project

vector ctlz should intelligently switch to scalar algorithm #39020

Extended Description