llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
26.71k stars 10.94k forks source link

Bad code generation for __builtin_clrsbl #41703

Open GabrielRavier opened 5 years ago

GabrielRavier commented 5 years ago
Bugzilla Link 42358
Version trunk
OS Linux
CC @topperc,@RKSimon,@rotateright

Extended Description

When compiling this code :

int clrsbl(unsigned long x)
{
    return __builtin_clrsbl(x);
}

Clang generates really bad code on several backends (x64, x86, RISCV-32, RISCV-64), such as (on x64)

  mov eax, 64
  add eax, -1

or (on RISCV-32)

  addi a0, zero, 32
  addi a0, a0, -1

This bug, however, is not present on certain backends such as WebAssembly or PowerPC 64 (le and not le)

I looked at the emitted llvm code, and I think this is probably an optimization problem related specifically to @llvm.ctlz. It looks like it's expanding ctlz after the passes that would eliminate the bad code.

Link to see examples of this across backends : https://godbolt.org/z/pbBKgD

I filed this as an x86 backend bug because I do not know how to file a bug for several backends at the same time

rotateright commented 5 years ago

For reference, the underlying CGP transform was: https://reviews.llvm.org/D13297

And cttz/ctlz were added/discussed here: https://reviews.llvm.org/D14630 https://reviews.llvm.org/D7554

We may want to adjust the TLI hook based on cmov existence or other target features.

topperc commented 5 years ago

Ignored what I said previously. The problem for X86 is that we despeculate cttz/ctlz into control flow during CodeGenPrepare and are unable to recover it. I wonder if we should keep cttz/ctlz around if we at least have cmov to somewhat cheaply handle the zero case.

llvmbot commented 1 year ago

@llvm/issue-subscribers-clang-codegen

RKSimon commented 1 year ago

If we can prove that the input to the lzcnt is never zero then the despeculation is avoided. Should we expand __builtin_clrsbl like gcc does?

int clrsbl(unsigned long x) {
    return __builtin_clrsbl(x);
}

int clrsbl_gcc(unsigned long x) {
    return __builtin_clzl(((x << 1) ^ ((long)x >> 63)) | 1);
}

define i32 @clrsbl(i64 %x)  {
entry:
  %x.lobit = ashr i64 %x, 63
  %0 = xor i64 %x.lobit, %x
  %1 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 false)
  %2 = trunc i64 %1 to i32
  %cast = add nsw i32 %2, -1
  ret i32 %cast
}

define i32 @clrsbl_gcc(i64 noundef %x) {
entry:
  %shl = shl i64 %x, 1
  %shr = ashr i64 %x, 63
  %xor = xor i64 %shl, %shr
  %or = or i64 %xor, 1
  %0 = tail call i64 @llvm.ctlz.i64(i64 %or, i1 true)
  %cast = trunc i64 %0 to i32
  ret i32 %cast
}

declare i64 @llvm.ctlz.i64(i64, i1 immarg)

clrsbl(unsigned long):                             # @clrsbl(unsigned long)
        movq    %rdi, %rax
        sarq    $63, %rax
        xorq    %rdi, %rax
        je      .LBB0_1
        bsrq    %rax, %rax
        xorq    $63, %rax
        decl    %eax
        retq
.LBB0_1:
        movl    $64, %eax
        decl    %eax
        retq
clrsbl_gcc(unsigned long):                        # @clrsbl_gcc(unsigned long)
        leaq    (%rdi,%rdi), %rax
        sarq    $63, %rdi
        xorq    %rax, %rdi
        orq     $1, %rdi
        bsrq    %rdi, %rax
        xorl    $63, %eax
        retq

Although the lzcnt codegen is slightly worse:

clrsbl(unsigned long):                             # @clrsbl(unsigned long)
        movq    %rdi, %rax
        sarq    $63, %rax
        xorq    %rdi, %rax
        lzcntq  %rax, %rax
        decl    %eax
        retq
clrsbl_gcc(unsigned long):                        # @clrsbl_gcc(unsigned long)
        leaq    (%rdi,%rdi), %rax
        sarq    $63, %rdi
        xorq    %rax, %rdi
        orq     $1, %rdi
        lzcntq  %rdi, %rax
        retq

Godbolt: https://godbolt.org/z/f5bnjGdqY Alive2: https://alive2.llvm.org/ce/z/RyCfKa

GabrielRavier commented 1 year ago

Looks like this has been fixed at some point since this bug report was made.

MaskRay commented 1 year ago
int clrsbl(unsigned long x)
{
    return __builtin_clrsbl(x);
}

gcc 12 avoids __clrsbdi2

clrsbl(unsigned long):
  lea rax, [rdi+rdi]
  sar rdi, 63
  xor rax, rdi
  or rax, 1
  bsr rax, rax
  xor eax, 63
  ret

Clang trunk's code is not as good as GCC's:

clrsbl(unsigned long): # @clrsbl(unsigned long)
  mov rax, rdi
  sar rax, 63
  xor rax, rdi
  je .LBB0_1
  bsr rax, rax
  xor rax, 63
  dec eax
  ret
.LBB0_1:
  mov eax, 64
  dec eax
  ret
GabrielRavier commented 1 year ago

Ah, true, I was looking at code generation with -mlzcnt active so I didn't see it was still doing a branch when using bsr.