popcount 8-bit optimization opportunity (without HW instructions)

Explorer09 commented 7 months ago

When compiling for x86-64 processor without POPCNT instruction, Clang generates inline code for __builtin_popcount. For counting bits of 8-bit only input, I have a smaller version than what Clang could generate. Are LLVM/Clang developers interested in incorporating this?

#include <stdint.h>

unsigned int popCount8(uint8_t x) {
    return (unsigned int)__builtin_popcount(x);
}
unsigned int popCount8_b(uint8_t x) {
    // The algorithm is specific to 8-bit input, and avoids using 64-bit register for code size.
    uint32_t n = (uint32_t)(x * 0x08040201U);
    n = (uint32_t)(((n >> 3) & 0x11111111U) * 0x11111111U) >> 28;
    return n;
}

Clang's assembly output (-Os -mno-popcnt)

popCount8:
 mov    eax,edi
 shr    al,1
 and    al,0x55
 sub    dil,al
 mov    eax,edi
 and    al,0x33
 shr    dil,0x2
 and    dil,0x33
 add    dil,al
 mov    eax,edi
 shr    al,0x4
 add    al,dil
 and    al,0xf
 movzx  eax,al
 ret
popCount8_b:
 imul   eax,edi,0x8040201
 shr    eax,0x3
 and    eax,0x11111111
 imul   eax,eax,0x11111111
 shr    eax,0x1c
 ret

RKSimon commented 7 months ago

Sure - this could be added as a special case in TargetLowering::expandCTPOP

Explorer09 commented 7 months ago

If you guys are interested, here is the 7-bit variant. Adapted from HAKMEM Item 167, original credit goes to Richard Schroeppel and others.

assert(x <= 127);
// Both would work, just pick which one would generate a smaller assembly:
(uint32_t)(((x * 0x4081) & 0x049249U) * (0x049249U << 11)) >> 29;
(uint32_t)(((x * 0x204081) & 0x11111111U) * 0x11111111U) >> 28;

RKSimon commented 7 months ago

Alive2: https://alive2.llvm.org/ce/z/sEUbYE

Explorer09 commented 7 months ago

Wow. You implemented that so quickly!

Just curious, does LLVM IR support integer bit widths other than multiples of 8 (i.e. anything other than i8, i16, i32 or i64)?

If not, the 7-bit and 14-bit variants of the algorithm might not be helpful for including them in LLVM. This algorithm of counting bits is not worth it for 16-bit integers or above, so I'm okay with only 8-bit version included.

RKSimon commented 7 months ago

Its still a draft but the codegen was pretty straightforward.

Most targets only support the traditional i8/i16/i32/i64 integer types, and smaller integers are extended to these types.

So a i3/i7/whatever variant would be extended to i8 (or more) and the ctpop performed at that width. What we could do is determine if the upper bits are known to be zero and use the i7 special case you proposed.

Explorer09 commented 7 months ago

I'll nickname my method in the proposed algorithm the "multiply-mask-multiply" method, and the modulo variant "multiply-mask-modulo". (The "MMM" acronym sounds catchy but I'm not using it yet.)

The method is not totally original, as Richard Schroeppel, Sean Anderson and Bruce Dawson (see credits below) each discovered one variant of the algorithm.

I took my time and explored other possible variants of the "multiply-mask-multiply" method for counting bits, and I think I have a complete list. Enjoy reading this:

List of optimal bit counting algorithms

Sorted by number of bits of the input (hereafter x).

2-bit integer bit count

Perhaps nothing can beat this (it's too trivial):

x - (x >> 1);

3-bit integer bit count

(a.) A 16-bit constant lookup might generate the smallest code:

/* 0xE994U == 0b11'10'10'01'10'01'0100U */
(0xE994U >> (x * 2)) & 0x3;

(b.) (Update 2024-06-06) Credit of the original idea goes to @aqrit (Michael Howard) (this comment), I modified his code to produce this 3-bit version. This uses the lower 4 bits of two registers, and no multiplication.

(x + (x & 0x3) + 0x2) >> 2;

(c.) The "multiply-mask-multiply" method would use only 8-bit registers but usually results in larger code, and thus not recommended:

(uint8_t)(((x * 0x9) & 0x55) * 0x55) >> 6;

4-bit integer bit count

(a.) If 64-bit registers are available, a 64-bit constant lookup might generate the smallest code:

/* 0x8DA691691448ULL == octal 04332322132212110ULL */
(0x8DA691691448ULL >> (x * 3)) & 0x7;
(0x4332322132212110ULL >> (x * 4)) & 0x7;

(b.) Or, alternatively, use a 16-byte lookup table such as this:

static const unsigned char bit_count_table[16] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
};

(c.) Use the 5-bit version of the "multiply-mask-multiply" method (see below).

(d.) The "parallel-addition" method of counting a 4-bit integer can work with only 8-bit registers, but usually results in larger code, and thus not recommended:

unsigned int bit_count_4_bits_d(uint8_t x)
{
    uint8_t tmp = x - ((x >> 1) & 0x5);
    return (tmp & 0x3) + (tmp >> 2);
}

5-bit integer bit count

Both "multiply-mask-multiply" and "multiply-mask-modulo" are available. They use only 16-bit registers. The "multiply" variant is preferred if the modulo operation is expensive for your processor or the compiler would optimize divisions (e.g. "-O2" mode).

/* 0x0421U == (1U | (1U << 5) | (1U << 10))
   0x1249U == octal 011111U */
(uint16_t)(((x * 0x0421U) & 0x1249U) * (0x1249U << 1)) >> 13;
(uint16_t)((x * 0x0421U) & 0x1249U) % 7;

6-bit and 7-bit integer bit counts

The "multiply-mask-multiply" and "-modulo" methods require at least 32-bit registers. All these four variants would work, so pick one that yields the smallest assembly for your processor. The first "modulo 7" variant works with only 6 bits of x; others 7 bits of x.

/* 0x4081U == (1U | (1U << 7) | (1U << 14))
   0x049249UL == octal 01111111UL
   0x204081UL == (1U | (1U << 7) | (1U << 14) | (1UL << 21)) */

/* Works with only 6 bits of x */
(uint32_t)((x * 0x4081U) & 0x049249UL) % 7;
/* Works with 7 bits of x */
(uint32_t)(((x * 0x4081U) & 0x049249UL) * (0x049249UL << 11)) >> 29;
(uint32_t)(((x * 0x204081UL) & 0x11111111UL) * 0x11111111UL) >> 28;
(uint32_t)((x * 0x204081UL) & 0x01111111UL) % 15;

8-bit integer bit count

(a.) The "multiply-mask-multiply" and "-modulo" methods, using only 32-bit registers:

/* 0x08040201UL == (1U | (1U << 9) | (1UL << 18) | (1UL << 27)) */
(uint32_t)((((x * 0x08040201UL) >> 3) & 0x11111111UL) * 0x11111111UL) >> 28;
(uint32_t)(((x * 0x08040201UL) >> 3) & 0x11111111UL) % 15;

(b.) A hybrid approach, which performs bits addition in the first step, then uses "multiply-mask-multiply" (or "-modulo") on the quaternary (base-4) digits. This uses 16-bit registers.

unsigned int bit_count_8_bits_b(uint8_t x)
{
    uint16_t tmp = x - ((x >> 1) & 0x55);
    tmp = ((tmp * 0x0401U) >> 2) & 0x3333U;
    /* Or: return (tmp % 15); */
    return (uint16_t)(tmp * 0x1111U) >> 12;
}

(c.) The traditional, "parallel-addition" method. Only 8-bit registers would be needed in this method. (Note the final step may be performed with either a "multiply-shift" or a modulo.)

unsigned int bit_count_8_bits_c(uint8_t x)
{
    uint8_t tmp = x - ((x >> 1) & 0x55);
    tmp = (tmp & 0x33) + ((tmp >> 2) & 0x33);
    /* Or: return (tmp % 15); */
    return (uint8_t)(tmp * 0x11) >> 4;
}

9-bit integer bit count

Original credit goes to Richard Schroeppel. From HAKMEM Item 167. This "multiply-mask-modulo" requires a 64-bit register.

/* <http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167>
   The first multiplier fits a 32-bit register, otherwise this brings
   no advantage over the 14-bit version below. */
(((uint64_t)x * 0x08040201ULL) & 0x0111111111ULL) % 15;

14-bit and 15-bit integer bit counts

(a.) Credit goes to Sean Eron Anderson and Bruce Dawson. From Bit Twiddling Hacks. The "modulo 15" variant works with only 14 bits of x; the multiply variant works with 15 bits of x.

/* <https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64>
   The version with modulo works with 14 bits of x only.
   0x200040008001ULL == (1U | (1U << 15) | (1UL << 30) | (1ULL << 45))
*/
((x * 0x200040008001ULL) & 0x0111111111111111ULL) % 15;

/* This version works with 15 bits of x */
(uint64_t)(((x * 0x200040008001ULL) & 0x1111111111111111ULL) \
 * 0x1111111111111111ULL) >> 60;

(b.) A hybrid approach, which performs bits addition in the first step, then uses "multiply-mask-multiply" (or "-modulo") on the quaternary (base-4) digits.

unsigned int bit_count_15_bits_b(uint16_t x)
{
    uint32_t tmp = x - ((x >> 1) & 0x5555U);
    tmp = (tmp * 0x4001UL) & 0x33333333UL;
    /* If we use "return (tmp % 15);" then up to 14-bits of x can
       be counted (one less than using the multiply). */
    return (uint32_t)(tmp * 0x11111111UL) >> 28;
}

16-bit integer bit count

(a.) Essentially (x & 1) + bit_count_15_bits(x >> 1)

(x & 1) + ((uint64_t)((((x >> 1) * 0x200040008001ULL) \
 & 0x1111111111111111ULL) * 0x1111111111111111ULL) >> 60);

(b.) (Update 2024-06-06) Credit of the original idea goes to @aqrit (Michael Howard) (this comment), I modified his code to produce this 16-bit version. It adds the number of bits in 5-bit groups (pantads), except for the last group which is 6 bits, before multiplication.

unsigned int bit_count_16_bits_b(uint16_t x)
{
    uint16_t tmp = x - ((x >> 1) & 0x4A52U) + 0x2109U;
    tmp = (tmp & 0x18C6U) + ((tmp >> 3) & 0x18C6U);
    return (uint16_t)(tmp * 0x0421U) >> 11;
}

(c.) The "parallel-addition" method that is well written in many textbooks. (Note the final step may be performed with either a "multiply-shift" or a modulo.)

unsigned int bit_count_16_bits_c(uint16_t x)
{
    uint16_t tmp = x - ((x >> 1) & 0x5555U);
    tmp = (tmp & 0x3333U) + ((tmp >> 2) & 0x3333U);
    tmp = (tmp + (tmp >> 4)) & 0x0F0FU;
    /* Or: return (tmp % 255); */
    return (uint16_t)(tmp * 0x0101U) >> 8;
}

20-bit and 22-bit integer bit counts

These both use hybrid approaches. They perform bits addition in the first step, then use "multiply-mask-multiply" on the quaternary (base-4) digits:

unsigned int bit_count_20_bits(uint32_t x)
{
    /* 0x010000100001ULL == (1U | (1UL << 20) | (1ULL << 40))
       0x00C30C30C30C30C3ULL == octal 00003030303030303030303ULL
       0x0820820820820820ULL == octal 00040404040404040404040ULL */
    uint64_t tmp = x - ((x >> 1) & 0x055555UL);
    tmp = (tmp * 0x010000100001ULL) & 0x00C30C30C30C30C3ULL;
    return (uint64_t)(tmp * 0x0820820820820820ULL) >> 59;
}

unsigned int bit_count_22_bits(uint32_t x)
{
    /* 0x080000200001ULL == (1U | (1UL << 21) | (1ULL << 43)) */
    uint64_t tmp = x - ((x >> 1) & 0x155555UL);
    tmp = (tmp * 0x080000200001ULL) & 0x18618618618C30C3ULL;
    return (uint64_t)(tmp * 0x0820821041041041ULL) >> 59;
}

28-bit integer bit count

(a.) This stacks two "multiply-mask-modulo" method of bit_count_14_bits(), i.e. bit_count_14_bits(x & 0x3FFF) + bit_count_14_bits(x >> 14). The Bit Twiddling Hacks page mentioned the 24-bit version but it's possible to extend it to 28 bits.

(((x & 0x3FFF) * 0x200040008001ULL) & 0x0111111111111111ULL) % 15 \
 + (((x >> 14) * 0x200040008001ULL) & 0x0111111111111111ULL) % 15;

(b.) This is a simplification of the "parallel-addition" counting code of a 32-bit integer. Read the 32-bit integer counting code below, before trying to understand this one.

unsigned int bit_count_28_bits_b1(uint32_t x)
{
    uint32_t tmp = x - ((x >> 1) & 0x05555555UL);
    tmp = (tmp & 0x03333333UL) + ((tmp >> 2) & 0x03333333UL);
    tmp = (tmp * 0x0111U) & 0x0F00F00FUL;
    return (uint32_t)(tmp * 0x01001001UL) >> 24;
}

unsigned int bit_count_28_bits_b2(uint32_t x)
{
    uint32_t tmp = x - ((x >> 1) & 0x05555555UL);
    tmp = (tmp & 0x03333333UL) + ((tmp >> 2) & 0x03333333UL);
    tmp = (tmp * 0x11) & 0x0F0F0F0FUL;
    /* Or: return (tmp % 255); */
    return (uint32_t)(tmp * 0x01010101UL) >> 24;
}

30-bit integer bit count

This stacks two "multiply-mask-multiply" method of bit_count_15_bits(), i.e. bit_count_15_bits(x & 0x7FFF) + bit_count_15_bits(x >> 15).

((uint64_t)((((x & 0x7FFF) * 0x200040008001ULL) \
 & 0x1111111111111111ULL) * 0x1111111111111111ULL) >> 60) \
 + ((uint64_t)((((x >> 15) * 0x200040008001ULL) \
 & 0x1111111111111111ULL) * 0x1111111111111111ULL) >> 60);

The "multiply-mask-multiply" method will become inefficient for counting bits if the input exceeds 30 bits.

32-bit integer bit count

(a.) (Update 2024-06-06) This algorithm is proposed by @aqrit (Michael Howard) in this comment. It adds the number of bits in 5-bit groups (pantads), except for the last group which is 6 bits, before multiplication. The MSB (bit 31) needs to be stored in a separate variable and added to the count in the last step.

(The "+ (x >> 31)" part may be optimized out if you use only 31 bits of x.)

unsigned int bit_count_32_bits_a(uint32_t x)
{
    uint32_t tmp = (uint32_t)(x + (x & 0xB5AD6B5BUL)) + 0x21084212UL;
    tmp = (tmp & 0x18C6318CUL) + ((tmp >> 3) & 0x18C6318CUL);
    return ((uint32_t)(tmp * 0x02108421UL) >> 27) + (x >> 31);
}

(b.) The "parallel-addition" method that is well written in many textbooks:

unsigned int bit_count_32_bits_b(uint32_t x)
{
    uint32_t tmp = x - ((x >> 1) & 0x55555555UL);
    tmp = (tmp & 0x33333333UL) + ((tmp >> 2) & 0x33333333UL);
    tmp = (tmp + (tmp >> 4)) & 0x0F0F0F0FUL;
    /* Or: return (tmp % 255); */
    return (uint32_t)(tmp * 0x01010101UL) >> 24;
}

(c.) If the processor supports 64-bit registers and an efficient load of a 64-bit constant, a simplification can be performed so the code can become this:

unsigned int bit_count_32_bits_c(uint32_t x)
{
    uint64_t tmp = x - ((x >> 1) & 0x55555555UL);
    tmp = (tmp & 0x33333333UL) + ((tmp >> 2) & 0x33333333UL);
    tmp = (tmp * 0x0111U) & 0xF00F00F0UL;
    return (uint64_t)(tmp * 0x0010010010000000ULL) >> 56;
}

60-bit integer bit count

(a.) (Update 2024-06-06) This algorithm is proposed by @aqrit (Michael Howard) in this comment. It adds the number of bits in 6-bit groups (sextets) before multiplication. A clever way of counting 3-bit groups is used.

unsigned int bit_count_60_bits_a(uint64_t x)
{
    /* 0x06DB6DB6DB6DB6DBULL == octal 00033333333333333333333ULL
       0x0492492492492492ULL == octal 00022222222222222222222ULL
       0x030C30C30C30C30CULL == octal 00014141414141414141414ULL
       0x0104104104104104ULL == octal 00004040404040404040404ULL */
    uint64_t mask1 = 0x06DB6DB6DB6DB6DBULL;
    uint64_t c1 = 0x0492492492492492ULL;
    uint64_t mask2 = 0x030C30C30C30C30CULL;
    uint64_t multiplier = 0x0104104104104104ULL;
    uint64_t tmp = x + (x & mask1) + c1;
    tmp = (tmp & mask2) + ((tmp >> 3) & mask2);
    return (uint64_t)(tmp * multiplier) >> 58;
}

(b.) A simplification of the counting code of a 64-bit integer (below). Read the 64-bit integer counting code below, before trying to understand this one.

unsigned int bit_count_60_bits_b1(uint64_t x)
{
    uint64_t mask1 = 0x0555555555555555ULL;
    uint64_t mask2 = 0x0333333333333333ULL;
    uint64_t mask4 = 0x0F00F00F00F00F00ULL;
    uint64_t multiplier = 0x0001001001001001ULL;
    uint64_t tmp = x - ((x >> 1) & mask1);
    tmp = (tmp & mask2) + ((tmp >> 2) & mask2);
    tmp = (tmp * 0x0111U) & mask4;
    return (uint64_t)(tmp * multiplier) >> 56;
}

unsigned int bit_count_60_bits_b2(uint64_t x)
{
    uint64_t mask1 = 0x0555555555555555ULL;
    uint64_t mask2 = 0x0333333333333333ULL;
    uint64_t mask4 = 0x0F0F0F0F0F0F0F0FULL;
    uint64_t multiplier = 0x0101010101010101ULL;
    uint64_t tmp = x - ((x >> 1) & mask1);
    tmp = (tmp & mask2) + ((tmp >> 2) & mask2);
    tmp = (tmp * 0x11) & mask4;
    /* Or: return (tmp % 255); */
    return (uint64_t)(tmp * multiplier) >> 56;
}

64-bit, 128-bit and 255-bit integer bit counts

The "parallel-addition" method that is well written in many textbooks. Up to 255 bits can count this way.

The 255-bit variant is not shown here (very few compilers support uint128_t already, let alone 256-bit types), but basically you know how to do it (the 255-bit variant cannot use the modulo in the final step).

(256 bits would overflow the storage of 8-bit temporary results, but can workaround with (x & 1) + bit_count_255_bits(x >> 1))

unsigned int bit_count_64_bits(uint64_t x)
{
    uint64_t mask1 = 0x5555555555555555ULL;
    uint64_t mask2 = 0x3333333333333333ULL;
    uint64_t mask4 = 0x0F0F0F0F0F0F0F0FULL;
    uint64_t multiplier = 0x0101010101010101ULL;
    uint64_t tmp = x - ((x >> 1) & mask1);
    tmp = (tmp & mask2) + ((tmp >> 2) & mask2);
    tmp = (tmp + (tmp >> 4)) & mask4;
    /* Or: return (tmp % 255); */
    return (uint64_t)(tmp * multiplier) >> 56;
}

unsigned int bit_count_128_bits(uint128_t x)
{
    uint128_t mask1 = (UINT128_MAX / 0xFF) * 0x55;
    uint128_t mask2 = (UINT128_MAX / 0xFF) * 0x33;
    uint128_t mask4 = (UINT128_MAX / 0xFF) * 0x0F;
    uint128_t multiplier = (UINT128_MAX / 0xFF);
    uint128_t tmp = x - ((x >> 1) & mask1);
    tmp = (tmp & mask2) + ((tmp >> 2) & mask2);
    tmp = (tmp + (tmp >> 4)) & mask4;
    /* Or: return (tmp % 255); */
    return (uint128_t)(tmp * multiplier) >> 120;
}

RKSimon commented 7 months ago

@Explorer09 Thank you for the awesome breakdown - I think we need to consider what is the likelihood of finding these in real world source code. i8 (and the i16 case) are likely to turn up, but irregular integer widths are trickier to work with.

dtcxzyw commented 7 months ago

@Explorer09 Thank you for the awesome breakdown - I think we need to consider what is the likelihood of finding these in real world source code. i8 (and the i16 case) are likely to turn up, but irregular integer widths are trickier to work with.

llvm.ctpop.i4 is used by grpc/protobuf/qemu/redis.

Explorer09 commented 7 months ago

Just a comment. When I tested this 8-bit popcount algorithm, I expect that 32-bit ARM can benefit from it (when NEON instructions are not available), in addition to x86.

The constant 0x11111111 can be compactly encoded into Operand2 of "MOV" instruction. But when I test this in Compiler Explorer (godbolt.org), the 32-bit clang compiler doesn't seem to utilize it (but 32-bit ARM GCC does). (Update: The compact encoding requires Thumb instructions support and "-mthumb" compile option. Thanks for the people who discovered this info for me.)

I wonder if this is just my error, like missing a compile option or anything.

; ARM GCC
popCount8_b:
        add     r0, r0, r0, lsl #9
        mov     r3, #286331153
        add     r0, r0, r0, lsl #18
        lsrs    r0, r0, #3
        and     r0, r0, #286331153
        muls    r0, r3, r0
        lsrs    r0, r0, #28
        bx      lr

; armv7-a clang
popCount8_b:
 ldr    r1, [pc, #20]   ; 44 <popCount8_b+0x1c>
 mul    r2, r0, r1
 ldr    r0, [pc, #16]   ; 48 <popCount8_b+0x20>
 and    r1, r0, r2, lsr #3
 mul    r2, r1, r0
 lsr    r0, r2, #28
 bx lr
 .word  0x08040201
 .word  0x11111111

; armv8-a clang
popCount8_b:
 mov    w8, #0x201
 and    w9, w0, #0xff
 movk   w8, #0x804, lsl #16
 mul    w8, w9, w8
 mov    w9, #0x11111111
 lsr    w8, w8, #3
 and    w8, w8, #0x11111111
 mul    w8, w8, w9
 lsr    w0, w8, #28
 ret

EDIT: By the way, (x * 0x08040201) can be converted to { uint32_t tmp = x + (x << 9); tmp += (tmp << 18); } for processors with efficient "shift + add" instructions. I just noted GCC did this.

RKSimon commented 7 months ago

@john-brawn-arm @davemgreen Can you confirm if we could replace the 0x11111111 load with an imm move?

davemgreen commented 7 months ago

Hi - it may depend on the architecture and on Arm vs Thumb. I would guess that GCC is defaulting it -mthumb and clang to -marm, but the exact architecture features can be difficult to get right at times too. (The default target without anything else specified might be armv4).

I see a single instruction with -mthumb, a movw/movk pair with -marm, so long as the architecture is recent enough. https://godbolt.org/z/4qaM8jhhq

Explorer09 commented 7 months ago

Hi - it may depend on the architecture and on Arm vs Thumb. I would guess that GCC is defaulting it -mthumb and clang to -marm, but the exact architecture features can be difficult to get right at times too. (The default target without anything else specified might be armv4).

Acknowledged. It seems like it depends on Thumb support.

aqrit commented 3 months ago

One can popcount 6-bit lanes faster than 8-bit lanes. This factoid is useful when we have fast multiplication and unusual bit-width sizes.

// 6-bit sub-groupings
uint64_t popcount_60bits (uint64_t x) {
    const uint64_t m0 = UINT64_C(0x06DB6DB6DB6DB6DB);
    const uint64_t m1 = UINT64_C(0x0492492492492492);
    const uint64_t m2 = UINT64_C(0x030C30C30C30C30C);
    const uint64_t m3 = UINT64_C(0x0104104104104104);

    x = x + (x & m0) + m1;
    x = (x & m2) + ((x >> 3) & m2);
    x = (x * m3) >> 58;
    return x;
}

// 5-bit sub-groups, except the right-most-group which is 6-bits
// (could be adapted to 32-bits: saves 1 op vs 8-bit, but would use 1 more temp var)
uint32_t popcount_31bits (uint32_t x) {
    const uint32_t m0 = UINT32_C(0xB5AD6B5B); // 0b1`011`01`011`01`011`01`011`01`011`01`011`011 
    const uint32_t m1 = UINT32_C(0x21084212); // 0b0`010`00`010`00`010`00`010`00`010`00`010`010
    const uint32_t m2 = UINT32_C(0x18C6318C); // 0b00011`00011`00011`00011`00011`00011`00
    const uint32_t m3 = UINT32_C(0x02108421); // 0b00`00001`00001`00001`00001`00001`00001

    x = x + (x & m0) + m1;
    x = (x & m2) + ((x >> 3) & m2);
    x = (x * m3) >> 27;
    return x;
}

Explorer09 commented 3 months ago

@aqrit Wow. Didn't realize this form of addition trick is possible. But I have one comment on your algorithm: The 0xB5AD6B5B constant could as well be 0x35AD6B5B, since the most significant bit must be zero anyway.

Explorer09 commented 3 months ago

Inspired by the algorithm from @aqrit, here are the 15- and 16-bit versions in case people are interested.

I think the 16-bit versions can help people understand how the algorithm works.

Does @aqrit claim originality on this algorithm?

unsigned int popcount_15bits (uint16_t x) {
    const uint16_t m0 = 0x35AEU; // 0b001101'01101'01101
    const uint16_t m1 = 0x2108U; // 0b001000'01000'01000
    const uint16_t m2 = 0x18C6U; // 0b000110'00110'00110
    const uint16_t m3 = 0x0421U; // 0b000001'00001'00001
    x = x + (x & m0) + m1;
    x = (x & m2) + ((x >> 3) & m2);
    x = (uint16_t)(x * m3) >> 11;
    return x;
}
unsigned int popcount_16bits (uint16_t x) {
    const uint16_t m0 = 0x4A52U; // 0b010010'10010'10010 (== ~0xB5AEU)
    const uint16_t m1 = 0x2109U; // 0b001000'01000'01001
    const uint16_t m2 = 0x18C6U; // 0b000110'00110'00110
    const uint16_t m3 = 0x0421U; // 0b000001'00001'00001
    x = x - ((x >> 1) & m0) + m1;
    x = (x & m2) + ((x >> 3) & m2);
    x = (uint16_t)(x * m3) >> 11;
    return x;
}

aqrit commented 3 months ago

Does @aqrit claim originality on this algorithm?

I wrote it from scratch.

It is hard to say if I've ever seen it in a blog post over the last 20 years... but also it is not very novel?

HAKMEM ITEM 169 does 6-bit lanes, the difference is whether the 3-bit sum is aligned at the left, right, or center?

0xB5AD6B5B constant could as well be 0x35AD6B5B

This doesn't impact the code size on x86 (not sure about arm). I left it in to clear the MSB if the function were modified to do 32-bit words.

Explorer09 commented 3 months ago

@aqrit

0xB5AD6B5B constant could as well be 0x35AD6B5B

This doesn't impact the code size on x86 (not sure about arm). I left it in to clear the MSB if the function were modified to do 32-bit words.

Do you mean this modification?

unsigned int popcount_32bits(uint32_t x) {
    const uint32_t m0 = 0xB5AD6B5BUL;
    const uint32_t m1 = 0x21084212UL;
    const uint32_t m2 = 0x18C6318CUL;
    const uint32_t m3 = 0x02108421UL;
    unsigned int msb = x >> 31;
    x = (uint32_t)(x + (x & m0)) + m1;
    x = (x & m2) + ((x >> 3) & m2);
    return ((uint32_t)(x * m3) >> 27) + msb;
}

It surely is one operation less than the "traditional" method (shown below), but depending on the hardware instruction set, I cannot say if the new code would always be smaller.

unsigned int bit_count_32_bits_a(uint32_t x) {
    const uint32_t m0 = 0x55555555UL;
    const uint32_t m1 = 0x33333333UL;
    const uint32_t m2 = 0x0F0F0F0FUL;
    const uint32_t m3 = 0x01010101UL;
    x = x - ((x >> 1) & m0);
    x = (x & m1) + ((x >> 2) & m1);
    x = (x + (x >> 4)) & m2;
    // Or: return (x % 255);
    return (uint32_t)(x * m3) >> 24;
}

Explorer09 commented 3 months ago

A good news for RISC-V users. A 3-bit pop count using Aqrit's clever addition produces smaller code in RISC-V than the lookup table approach previously proposed.

unsigned int popcount_3bits(uint8_t x) {
    return (x + (x & 0x3) + 0x2) >> 2;
}
unsigned int popcount_3bits_lut(uint8_t x) {
    return (0xE994U >> (x * 2)) & 0x3;
}

Example in Godbolt.org for people who want to try it out: https://godbolt.org/z/4WP3bvdaY

aqrit commented 3 months ago

Does @aqrit claim originality on this algorithm?

No. I'm pretty sure it was in a blog post. However, I can't find it.

:frowning:

Explorer09 commented 3 months ago

Does @aqrit claim originality on this algorithm?

No. I'm pretty sure it was in a blog post. However, I can't find it.

The closest I could find are these three articles...

https://tekpool.wordpress.com/category/bit-count/
https://blog.csdn.net/u012028275/article/details/111601920 (Chinese article)
https://blog.csdn.net/m0_71299382/article/details/134357433 (Chinese article)

... that mentioned about the x - ((x >> 1) & 033333333333) - ((x >> 2) & 011111111111) trick. However, I found no articles that used the optimized code of (x + (x & 033333333333)) + 022222222222) >> 2. I think the latter way of adding bits in three-bit groups are pretty new.

llvm / llvm-project

popcount 8-bit optimization opportunity (without HW instructions) #79823

List of optimal bit counting algorithms