On targets where there is no direct count-trailing-zeros instruction, some insert a branch on 0 before running the code that calculates the trailing-zeros count. In my code, I want to commandeer this branch for added efficiency, however, LLVM gets confused when there is a bitwise negation present.
Here is some simplified Zig code which reproduces the problem (godbolt link here).
export fn foo(bitstring: u64, sum: u64) u64 {
var acc: u64 = sum;
const inverted_bitstring = ~bitstring;
// Optimization: If @ctz implicitly inserts `if (inverted_bitstring == 0)`, make it more efficient.
if (inverted_bitstring == 0) {
acc += 64;
return acc;
}
const str_len: u8 = @ctz(inverted_bitstring);
acc += str_len;
return acc;
}
On x86_64 (default target) we get:
foo:
cmp rdi, -1
je .LBB0_1
xor rdi, -1
je .LBB0_3
rep bsf rax, rdi
add rax, rsi
ret
.LBB0_1:
add rsi, 64
mov rax, rsi
ret
.LBB0_3:
mov eax, 64
add rax, rsi
ret
On riscv64 (default target) we get the same branch-on-equal instruction twice.
foo:
li a2, -1
beq a0, a2, .LBB0_3
beq a0, a2, .LBB0_4
...
...
.LBB0_3:
addi a0, a1, 64
ret
.LBB0_4:
li a0, 64
add a0, a0, a1
ret
Also does the same thing on s390x, sparc64, and ve (default targets). If you would like to see some less-simplified code where I actually tried to use this idea, here is a godbolt link to a less-trimmed version of my actual code (go to line 461 in the source code).
On targets where there is no direct count-trailing-zeros instruction, some insert a branch on 0 before running the code that calculates the trailing-zeros count. In my code, I want to commandeer this branch for added efficiency, however, LLVM gets confused when there is a bitwise negation present.
Here is some simplified Zig code which reproduces the problem (godbolt link here).
On x86_64 (default target) we get:
On riscv64 (default target) we get the same branch-on-equal instruction twice.
Also does the same thing on s390x, sparc64, and ve (default targets). If you would like to see some less-simplified code where I actually tried to use this idea, here is a godbolt link to a less-trimmed version of my actual code (go to line 461 in the source code).
Thank you to all LLVM contributors!
‒ Validark