new scalar umul128 method

aqrit commented 5 years ago

If this project wants to collect all possible methods, here is another one: https://gist.github.com/aqrit/c729815b0165c139d0bac642ab7ee104

Expected to be 37.5% slower than sse2_sad.

lemire commented 5 years ago

Thanks!

mklarqvist commented 5 years ago

Thanks @aqrit. I will add it.

aqrit commented 5 years ago

Ugh! If it were unrolled 2x... the second uint64_t could be processed for ~half-price. 4-bit/12-bit counters instead of 3-bit/12-bit counters.

mklarqvist commented 5 years ago

Ha! Great timing. I am literally computing the performance profile to paste here as we speak.

mklarqvist commented 5 years ago

@aqrit Unrolling seems to reduce performance. Haswell machine:

n = 1000000 
iterations = 10 
array size: 1.907 MB
avx256popcnt                                cycles per 16-bit word:  0.316; ref cycles per 16-bit word: 0.259 
pospopcnt_u16                               cycles per 16-bit word:  0.344; ref cycles per 16-bit word: 0.275 
pospopcnt_u16_scalar_naive                  cycles per 16-bit word:  3.936; ref cycles per 16-bit word: 3.463 
pospopcnt_u16_scalar_naive_nosimd           cycles per 16-bit word:  18.055; ref cycles per 16-bit word: 14.282 
pospopcnt_u16_scalar_partition              cycles per 16-bit word:  3.364; ref cycles per 16-bit word: 2.643 
pospopcnt_u16_scalar_hist1x4                cycles per 16-bit word:  3.124; ref cycles per 16-bit word: 2.456 
pospopcnt_u16_scalar_umul128                cycles per 16-bit word:  2.415; ref cycles per 16-bit word: 1.898 
pospopcnt_u16_scalar_umul128_unroll2        cycles per 16-bit word:  2.969; ref cycles per 16-bit word: 2.333 
pospopcnt_u16_sse_single                    cycles per 16-bit word:  4.308; ref cycles per 16-bit word: 3.385 
pospopcnt_u16_sse_mula                      cycles per 16-bit word:  2.135; ref cycles per 16-bit word: 1.740 
pospopcnt_u16_sse_mula_unroll4              cycles per 16-bit word:  1.736; ref cycles per 16-bit word: 1.428 
pospopcnt_u16_sse_mula_unroll8              cycles per 16-bit word:  1.535; ref cycles per 16-bit word: 1.251 
pospopcnt_u16_sse_mula_unroll16             cycles per 16-bit word:  1.536; ref cycles per 16-bit word: 1.208 
pospopcnt_u16_sse2_sad                      cycles per 16-bit word:  1.384; ref cycles per 16-bit word: 1.116 
pospopcnt_u16_sse2_csa                      cycles per 16-bit word:  0.428; ref cycles per 16-bit word: 0.357 
pospopcnt_u16_avx2_popcnt                   cycles per 16-bit word:  3.133; ref cycles per 16-bit word: 2.469 
pospopcnt_u16_avx2                          cycles per 16-bit word:  4.074; ref cycles per 16-bit word: 3.225 
pospopcnt_u16_avx2_naive_counter            cycles per 16-bit word:  4.002; ref cycles per 16-bit word: 3.144 
pospopcnt_u16_avx2_single                   cycles per 16-bit word:  3.704; ref cycles per 16-bit word: 2.920 
pospopcnt_u16_avx2_lemire                   cycles per 16-bit word:  2.193; ref cycles per 16-bit word: 1.733 
pospopcnt_u16_avx2_lemire2                  cycles per 16-bit word:  1.502; ref cycles per 16-bit word: 1.191 
pospopcnt_u16_avx2_mula                     cycles per 16-bit word:  1.426; ref cycles per 16-bit word: 1.135 
pospopcnt_u16_avx2_mula_unroll4             cycles per 16-bit word:  0.960; ref cycles per 16-bit word: 0.772 
pospopcnt_u16_avx2_mula_unroll8             cycles per 16-bit word:  0.816; ref cycles per 16-bit word: 0.641 
pospopcnt_u16_avx2_mula_unroll16            cycles per 16-bit word:  0.844; ref cycles per 16-bit word: 0.663 
pospopcnt_u16_avx2_mula3                    cycles per 16-bit word:  0.517; ref cycles per 16-bit word: 0.440 
pospopcnt_u16_avx2_csa                      cycles per 16-bit word:  0.329; ref cycles per 16-bit word: 0.275

aqrit commented 5 years ago

unroll + rewrite (which I will do)

the 2nd u64 would cost: 1 load 4 and 5 add 1 mul edit: 1 shr

aqrit commented 5 years ago

updated the gist w/unrolled version https://gist.github.com/aqrit/c729815b0165c139d0bac642ab7ee104

aqrit commented 5 years ago

Same trick could be applied to sse2_sad ... I've got 12 extra psadbw instructions in there. /facepalm

aqrit commented 5 years ago

We can add a third u64 into pospopcnt_u16_scalar_umul128 on the cheap. (sorry for spamming)

mklarqvist commented 5 years ago

Thanks @aqrit . It is indeed slightly faster.

pospopcnt_u16_scalar_umul128                instructions per cycle 3.53, cycles per 16-bit word:  2.414, instructions per 16-bit word 8.528 
min:  2413517 cycles,  8527767 instructions,         248 branch mis.,     8206 cache ref.,        0 cache mis.
avg: 2426353.7 cycles, 8527767.2 instructions,     248.8 branch mis.,  11234.4 cache ref.,      1.0 cache mis.

pospopcnt_u16_scalar_umul128_unroll2        instructions per cycle 2.94, cycles per 16-bit word:  2.389, instructions per 16-bit word 7.027 
min:  2389269 cycles,  7026782 instructions,         248 branch mis.,    10193 cache ref.,        0 cache mis.
avg: 2400136.8 cycles, 7026782.2 instructions,     249.3 branch mis.,  11906.7 cache ref.,      0.8 cache mis.

mklarqvist commented 5 years ago

Now added.

mklarqvist / positional-popcount

new scalar umul128 method #17