Open Validark opened 2 months ago
@Validark Can you explain what @foo
is used for?
I wrote this code because I wanted to produce a mask with the same number of bits as in x
, but all concentrated in the front.
I was originally going to feed this, perhaps with some transformation, into a pdep
instruction, to turn off certain bits of x
, but I ended up needing to preserve some other bits that made me shift strategies.
I think the idea is pretty general-purpose still though. The idea is:
-1 << (k + 1)
=> -2 << k
I think this would benefit code in the wild.
-1 << (k + 1) => -2 << k
LLVM doesn't fold this pattern since it cannot prove k + 1 < bitwidth
(i.e., @popCount(~m & (m << 1)) < bitwidth - 1
) : https://alive2.llvm.org/ce/z/FfYEmg
The pattern @popCount(~m & (m << 1))
does not seem to exist in the real-world code :(
-1 << (k + 1) => -2 << k
LLVM doesn't fold this pattern since it cannot prove
k + 1 < bitwidth
(i.e.,@popCount(~m & (m << 1)) < bitwidth - 1
) : https://alive2.llvm.org/ce/z/FfYEmg
Did you realize you were using i16's? For i16, the maximum popCount of m & ~(m << 1)
, ~m & (m << 1)
, and m & (~m << 1)
, and versions with <<
turned to >>
, is 8. And LLVM is able to prove this: https://alive2.llvm.org/ce/z/oNtS3G
The pattern
@popCount(~m & (m << 1))
does not seem to exist in the real-world code :(
The significance of the pattern is just that all patterns that look like it by definition have a popcount of up to half of the bitstring length. I don't think that this specific pattern is what you should search for, it's just one example.
Also, even setting aside the pattern, when I have -1 << @intCast(k + 1)
, I am guaranteeing k + 1
is in the range [0, BIT_LEN)
, or rather, [1, BIT_LEN)
if we look at that +1
. Hence the @intCast()
call which is asserting that it is a valid u6
for shifting a u64
bitstring. Therefore this transformation should be safe. (Note that overflow is guaranteed not to occur, both by the possible range from the popcount and the fact that popCount returns a u7
, which is meant to hold the maximum popcount of 64
, therefore adding 1 would not overflow a u7, and my intCast asserts the resulting value will be 63
or less. Therefore, we can safely shift the range of k + 1
=> [1, 63] to k
=> [0, 62] to do this optimization, even without the knowledge of the popcount being at most 32 for i64's)
And LLVM is able to prove this: https://alive2.llvm.org/ce/z/oNtS3G
It is proven by Z3, not LLVM.
I don't think that this specific pattern is what you should search for, it's just one example.
As I said before, we can address this issue by proving @popCount(~m & (m << 1)) < bitwidth - 1
. But this pattern doesn't exist in real-world code. So I think it is not interesting.
Hence the
@intCast()
call which is asserting that it is a validu6
for shifting au64
bitstring.
FWIW, this is not at all obvious for anyone who is not very familiar with zig. It's also not clear how zig is conveying this information to LLVM, if it is doing so at all.
Please, can you avoid using Zig when reporting LLVM issues, if possible? If you can't write LLVM IR directly, any of C, C++ and Rust is fine to use, as these are languages that are easy to understand for LLVM developers, and we know how to interact with the frontend to obtain LLVM IR at different optimization stages.
As I said before, we can address this issue by proving
@popCount(~m & (m << 1)) < bitwidth - 1
. But this pattern doesn't exist in real-world code. So I think it is not interesting.
I agree.
I think this issue would be solved if a) zig lowers the @intCast
to trunc nuw
and b) LLVM elides the zext (trunc nuw)
instead of generating an and
? Without the and
the shl + add nuw pattern should fold fine.
b) will be fixed when https://github.com/llvm/llvm-project/pull/88609 lands, I don't know what zig does about a) as I don't know how to get the pre-optimization IR for this compiler.
Hence the
@intCast()
call which is asserting that it is a validu6
for shifting au64
bitstring.FWIW, this is not at all obvious for anyone who is not very familiar with zig. It's also not clear how zig is conveying this information to LLVM, if it is doing so at all.
Please, can you avoid using Zig when reporting LLVM issues, if possible? If you can't write LLVM IR directly, any of C, C++ and Rust is fine to use, as these are languages that are easy to understand for LLVM developers, and we know how to interact with the frontend to obtain LLVM IR at different optimization stages.
As I said before, we can address this issue by proving
@popCount(~m & (m << 1)) < bitwidth - 1
. But this pattern doesn't exist in real-world code. So I think it is not interesting.I agree.
You can get Zig to emit the LLVM IR directly like so:
zig build-obj ./src/llvm_code.zig -O ReleaseFast -target x86_64-linux -mcpu znver4 -femit-llvm-ir -fstrip
Gives:
; ModuleID = 'BitcodeBuffer'
source_filename = "llvm_code"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-musl"
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local i64 @foo(i64 %0) local_unnamed_addr #0 {
%2 = xor i64 %0, -1
%3 = shl i64 %0, 1
%4 = and i64 %3, %2
%5 = tail call i64 @llvm.ctpop.i64(i64 %4), !range !0
%6 = add nuw nsw i64 %5, 1
%7 = and i64 %6, 63
%notmask = shl nsw i64 -1, %7
%8 = xor i64 %notmask, -1
ret i64 %8
}
; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i64 @llvm.ctpop.i64(i64) #1
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local i64 @bar(i64 %0) local_unnamed_addr #0 {
%2 = xor i64 %0, -1
%3 = shl i64 %0, 1
%4 = and i64 %3, %2
%5 = tail call i64 @llvm.ctpop.i64(i64 %4), !range !0
%6 = shl i64 -2, %5
%7 = xor i64 %6, -1
ret i64 %7
}
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local i64 @food(i64 %0) local_unnamed_addr #0 {
%2 = tail call i64 @foo(i64 %0) #0
ret i64 %2
}
attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable "frame-pointer"="none" "target-cpu"="znver4" "target-features"="-16bit-mode,-32bit-mode,-3dnow,-3dnowa,+64bit,+adx,+aes,+allow-light-256-bit,-amx-bf16,-amx-complex,-amx-fp16,-amx-int8,-amx-tile,+avx,-avx10.1-256,-avx10.1-512,+avx2,+avx512bf16,+avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512fp16,+avx512ifma,-avx512pf,+avx512vbmi,+avx512vbmi2,+avx512vl,+avx512vnni,-avx512vp2intersect,+avx512vpopcntdq,-avxifma,-avxneconvert,-avxvnni,-avxvnniint16,-avxvnniint8,+bmi,+bmi2,+branchfusion,-ccmp,-cf,-cldemote,+clflushopt,+clwb,+clzero,+cmov,-cmpccxadd,+crc32,+cx16,+cx8,-egpr,-enqcmd,-ermsb,+evex512,+f16c,-false-deps-getmant,-false-deps-lzcnt-tzcnt,-false-deps-mulc,-false-deps-mullq,-false-deps-perm,-false-deps-popcnt,-false-deps-range,-fast-11bytenop,+fast-15bytenop,-fast-7bytenop,+fast-bextr,-fast-gather,-fast-hops,+fast-lzcnt,+fast-movbe,+fast-scalar-fsqrt,+fast-scalar-shift-masks,-fast-shld-rotate,-fast-variable-crosslane-shuffle,+fast-variable-perlane-shuffle,+fast-vector-fsqrt,-fast-vector-shift-masks,-faster-shift-than-shuffle,+fma,-fma4,+fsgsbase,+fsrm,+fxsr,+gfni,-harden-sls-ijmp,-harden-sls-ret,-hreset,-idivl-to-divb,-idivq-to-divl,+invpcid,-kl,-lea-sp,-lea-uses-ag,-lvi-cfi,-lvi-load-hardening,-lwp,+lzcnt,+macrofusion,+mmx,+movbe,-movdir64b,-movdiri,+mwaitx,-ndd,-no-bypass-delay,-no-bypass-delay-blend,-no-bypass-delay-mov,-no-bypass-delay-shuffle,+nopl,-pad-short-functions,+pclmul,-pconfig,+pku,+popcnt,-ppx,-prefer-128-bit,-prefer-256-bit,-prefer-mask-registers,-prefer-movmsk-over-vtest,-prefer-no-gather,-prefer-no-scatter,-prefetchi,-prefetchwt1,+prfchw,-ptwrite,-push2pop2,-raoint,+rdpid,+rdpru,+rdrnd,+rdseed,-retpoline,-retpoline-external-thunk,-retpoline-indirect-branches,-retpoline-indirect-calls,-rtm,+sahf,+sbb-dep-breaking,-serialize,-seses,-sgx,+sha,-sha512,+shstk,-slow-3ops-lea,-slow-incdec,-slow-lea,-slow-pmaddwd,-slow-pmulld,+slow-shld,-slow-two-mem-ops,-slow-unaligned-mem-16,-slow-unaligned-mem-32,-sm3,-sm4,-soft-float,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,-sse-unaligned-mem,+ssse3,-tagged-globals,-tbm,-tsxldtrk,-tuning-fast-imm-vector-shift,-uintr,-use-glm-div-sqrt-costs,-use-slm-arith-costs,-usermsr,+vaes,+vpclmulqdq,+vzeroupper,-waitpkg,+wbnoinvd,-widekl,+x87,-xop,+xsave,+xsavec,+xsaveopt,+xsaves" }
attributes #1 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
!0 = !{i64 0, i64 64}
Let's say I have this function:
LLVM handily optimizes it to the equivalent of this:
However, we can do slightly better by moving the
+1
into~@as(u64, 0)
by pre-shifting it by 1. Since~@as(u64, 0) << 1
is~@as(u64, 1)
, we get:Here is the assembly version:
This optimization should be applicable across architectures. (Godbolt link)