Open GabrielRavier opened 5 years ago
For reference, the underlying CGP transform was: https://reviews.llvm.org/D13297
And cttz/ctlz were added/discussed here: https://reviews.llvm.org/D14630 https://reviews.llvm.org/D7554
We may want to adjust the TLI hook based on cmov existence or other target features.
Ignored what I said previously. The problem for X86 is that we despeculate cttz/ctlz into control flow during CodeGenPrepare and are unable to recover it. I wonder if we should keep cttz/ctlz around if we at least have cmov to somewhat cheaply handle the zero case.
@llvm/issue-subscribers-clang-codegen
If we can prove that the input to the lzcnt is never zero then the despeculation is avoided. Should we expand __builtin_clrsbl like gcc does?
int clrsbl(unsigned long x) {
return __builtin_clrsbl(x);
}
int clrsbl_gcc(unsigned long x) {
return __builtin_clzl(((x << 1) ^ ((long)x >> 63)) | 1);
}
define i32 @clrsbl(i64 %x) {
entry:
%x.lobit = ashr i64 %x, 63
%0 = xor i64 %x.lobit, %x
%1 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 false)
%2 = trunc i64 %1 to i32
%cast = add nsw i32 %2, -1
ret i32 %cast
}
define i32 @clrsbl_gcc(i64 noundef %x) {
entry:
%shl = shl i64 %x, 1
%shr = ashr i64 %x, 63
%xor = xor i64 %shl, %shr
%or = or i64 %xor, 1
%0 = tail call i64 @llvm.ctlz.i64(i64 %or, i1 true)
%cast = trunc i64 %0 to i32
ret i32 %cast
}
declare i64 @llvm.ctlz.i64(i64, i1 immarg)
clrsbl(unsigned long): # @clrsbl(unsigned long)
movq %rdi, %rax
sarq $63, %rax
xorq %rdi, %rax
je .LBB0_1
bsrq %rax, %rax
xorq $63, %rax
decl %eax
retq
.LBB0_1:
movl $64, %eax
decl %eax
retq
clrsbl_gcc(unsigned long): # @clrsbl_gcc(unsigned long)
leaq (%rdi,%rdi), %rax
sarq $63, %rdi
xorq %rax, %rdi
orq $1, %rdi
bsrq %rdi, %rax
xorl $63, %eax
retq
Although the lzcnt codegen is slightly worse:
clrsbl(unsigned long): # @clrsbl(unsigned long)
movq %rdi, %rax
sarq $63, %rax
xorq %rdi, %rax
lzcntq %rax, %rax
decl %eax
retq
clrsbl_gcc(unsigned long): # @clrsbl_gcc(unsigned long)
leaq (%rdi,%rdi), %rax
sarq $63, %rdi
xorq %rax, %rdi
orq $1, %rdi
lzcntq %rdi, %rax
retq
Godbolt: https://godbolt.org/z/f5bnjGdqY Alive2: https://alive2.llvm.org/ce/z/RyCfKa
Looks like this has been fixed at some point since this bug report was made.
int clrsbl(unsigned long x)
{
return __builtin_clrsbl(x);
}
gcc 12 avoids __clrsbdi2
clrsbl(unsigned long):
lea rax, [rdi+rdi]
sar rdi, 63
xor rax, rdi
or rax, 1
bsr rax, rax
xor eax, 63
ret
Clang trunk's code is not as good as GCC's:
clrsbl(unsigned long): # @clrsbl(unsigned long)
mov rax, rdi
sar rax, 63
xor rax, rdi
je .LBB0_1
bsr rax, rax
xor rax, 63
dec eax
ret
.LBB0_1:
mov eax, 64
dec eax
ret
Ah, true, I was looking at code generation with -mlzcnt
active so I didn't see it was still doing a branch when using bsr
.
Extended Description
When compiling this code :
Clang generates really bad code on several backends (x64, x86, RISCV-32, RISCV-64), such as (on x64)
or (on RISCV-32)
This bug, however, is not present on certain backends such as WebAssembly or PowerPC 64 (le and not le)
I looked at the emitted llvm code, and I think this is probably an optimization problem related specifically to @llvm.ctlz. It looks like it's expanding ctlz after the passes that would eliminate the bad code.
Link to see examples of this across backends : https://godbolt.org/z/pbBKgD
I filed this as an x86 backend bug because I do not know how to file a bug for several backends at the same time