bithacks: optimize code that counts bits

nlewycky commented 17 years ago


Bugzilla Link	1488
Resolution	LATER
Resolved on	Sep 23, 2013 17:45
Version	trunk
OS	All
Blocks	llvm/llvm-bugzilla-archive#17101
CC	@lattner,@efriedma-quic,@nadavrot

Extended Description

The common idiom for counting the number of bits set can be optimized.

unsigned countbits_slow(unsigned v) { unsigned c; for (c = 0; v; v >>= 1) { c += v & 1; } return c; }

This can be rewritten as such:

unsigned countbits_fast(unsigned v) { unsigned c; for (c = 0; v; c++) { v &= v - 1; // clear the least significant bit set } return c; }

There are other ways to rewrite the bit-counting code, but they involve other tradeoffs such as constant arrays. Also, the naive method may be written such that it starts with the high bit or low bit for performance, if the programmer happens to know about the distribution of the bits in advance. countbits_fast will equal or outperform either implementation.

With stock GCC, countbits_slow has a 63% longer run-time in a tight loop, while with llvm-gcc, countbits_slow has a 45% longer run-time. (Sadly, llvm-gcc is slower than GCC at either function on x86 Linux, but that's another bug.)

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#1792

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#17101

llvmbot commented 11 years ago

We are now able to recognize following patten. Since Popcount recognization is architecture-dependent optimization. It is enabled only if the underlying architecture has HW support. ScalarTargetTransformInfo::getPopcntHwSupport() serves as the interface between the LoopIdiomRecognize pass and HW support.

On x8864, currently popcount optimization is trigged if -msse4 is turned on (or other flags that implies sse4). The CodeGen will emit POPCNT instruction (in SSE4 instruction set).

According to "http://www.strchr.com/crc32_popcnt", the POPCNT instruction is not as faster as the SSE3 implementation.

In the near future, we need to implement popcount HW support using SSE3 instruction.

int PopCnt(register BITBOARD a) { register int c=0; while(a) { c++; a &= a - 1; } return(c); }

lattner commented 16 years ago

I moved these ideas into lib/Target/README.txt

efriedma-quic commented 16 years ago

This requires SSE4 support: http://en.wikipedia.org/wiki/SSE4#SSE4.2

crafty is actually just using a simple loop for popcount on X86; I think it ends up being faster than __builtin_popcount because the popcount is usually small.

lattner commented 16 years ago

I'm not aware of any shipping cpu with SSE4.2. Penryn etc only have 4.1

nadavrot commented 16 years ago

If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge speedup for crafty.

For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if you pass in the right flags on X86). And the asm implementations are faster than what LLVM currently generates.

This requires SSE4 support: http://en.wikipedia.org/wiki/SSE4#SSE4.2

efriedma-quic commented 16 years ago

If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge speedup for crafty.

For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if you pass in the right flags on X86). And the asm implementations are faster than what LLVM currently generates.

llvmbot commented 17 years ago

Bug llvm/llvm-bugzilla-archive#1792 has been marked as a duplicate of this bug.

llvmbot commented 17 years ago

I just got this on IRC:

unsigned int popcount(unsigned int input) { unsigned int count = 0; for (unsigned int i = 0; i < 4 * 8; i++) count += (input >> i) & i; return count; }

with the question, "why doesn't loop-unroll unroll my loop?".

lattner commented 17 years ago

fwiw, 186.crafty contains this loop:

int PopCnt(register BITBOARD a) { register int c=0; while(a) { c++; a &= a - 1; } return(c); }

BITBOARD = unsigned long long

If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge speedup for crafty.

-Chris

lattner commented 17 years ago

The right way to do this is to recognize the loop and turn it into a call to llvm.ctpop

nlewycky commented 17 years ago

I forgot to mention that this is also the llvm.ctpop.* intrinsic. We should match the code better and, at least on x86, emit it better.

llvm / llvm-project

bithacks: optimize code that counts bits #1860

Extended Description