Closed nlewycky closed 2 years ago
mentioned in issue llvm/llvm-bugzilla-archive#1792
mentioned in issue llvm/llvm-bugzilla-archive#17101
We are now able to recognize following patten. Since Popcount recognization is architecture-dependent optimization. It is enabled only if the underlying architecture has HW support. ScalarTargetTransformInfo::getPopcntHwSupport() serves as the interface between the LoopIdiomRecognize pass and HW support.
On x8864, currently popcount optimization is trigged if -msse4 is turned on (or other flags that implies sse4). The CodeGen will emit POPCNT instruction (in SSE4 instruction set).
According to "http://www.strchr.com/crc32_popcnt", the POPCNT instruction is not as faster as the SSE3 implementation.
In the near future, we need to implement popcount HW support using SSE3 instruction.
int PopCnt(register BITBOARD a) { register int c=0; while(a) { c++; a &= a - 1; } return(c); }
I moved these ideas into lib/Target/README.txt
This requires SSE4 support: http://en.wikipedia.org/wiki/SSE4#SSE4.2
crafty is actually just using a simple loop for popcount on X86; I think it ends up being faster than __builtin_popcount because the popcount is usually small.
I'm not aware of any shipping cpu with SSE4.2. Penryn etc only have 4.1
If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge speedup for crafty.
For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if you pass in the right flags on X86). And the asm implementations are faster than what LLVM currently generates.
This requires SSE4 support: http://en.wikipedia.org/wiki/SSE4#SSE4.2
If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge speedup for crafty.
For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if you pass in the right flags on X86). And the asm implementations are faster than what LLVM currently generates.
Bug llvm/llvm-bugzilla-archive#1792 has been marked as a duplicate of this bug.
I just got this on IRC:
unsigned int popcount(unsigned int input) { unsigned int count = 0; for (unsigned int i = 0; i < 4 * 8; i++) count += (input >> i) & i; return count; }
with the question, "why doesn't loop-unroll unroll my loop?".
fwiw, 186.crafty contains this loop:
int PopCnt(register BITBOARD a) { register int c=0; while(a) { c++; a &= a - 1; } return(c); }
BITBOARD = unsigned long long
If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge speedup for crafty.
-Chris
The right way to do this is to recognize the loop and turn it into a call to llvm.ctpop
I forgot to mention that this is also the llvm.ctpop.* intrinsic. We should match the code better and, at least on x86, emit it better.
Extended Description
The common idiom for counting the number of bits set can be optimized.
unsigned countbits_slow(unsigned v) { unsigned c; for (c = 0; v; v >>= 1) { c += v & 1; } return c; }
This can be rewritten as such:
unsigned countbits_fast(unsigned v) { unsigned c; for (c = 0; v; c++) { v &= v - 1; // clear the least significant bit set } return c; }
There are other ways to rewrite the bit-counting code, but they involve other tradeoffs such as constant arrays. Also, the naive method may be written such that it starts with the high bit or low bit for performance, if the programmer happens to know about the distribution of the bits in advance. countbits_fast will equal or outperform either implementation.
With stock GCC, countbits_slow has a 63% longer run-time in a tight loop, while with llvm-gcc, countbits_slow has a 45% longer run-time. (Sadly, llvm-gcc is slower than GCC at either function on x86 Linux, but that's another bug.)