Closed lemire closed 5 years ago
Thanks @lemire. Great work @WojciechMula! I'll have to address correctness before merging this PR. I'll also port it to AVX-512.
Isn't the "end game" a Carry-Save-Adder network again? With only the overflow from that going to these functions... which would also mean that the best solution would be one that uses the fewest possible vector registers?
@WojciechMula
@aqrit I was thinking about this, but the problem is with accumulation of the final vector. In regula popcount we simply mul-add them together, here we'd need to obtain several sums.
I have a new scheme temporarily called purecircuit
. It is only AVX2 and seems to beat everything else. As suggested by @aqrit it is based on a CSA circuint.
cc @WojciechMula
pospopcnt_u16_avx512_mula_unroll4 cycles per 16-bit word: 0.584
pospopcnt_u16_avx512_mula_unroll8 cycles per 16-bit word: 0.584
purecircuit cycles per 16-bit word: 0.261
I even find the speed hard to believe... but there you go.
(It is not yet integrated in the main code base, I just appended it in a dirty manner to benchmark/linux/instrumented_benchmark.cpp
. The name is obviously not good (should probably call it csa something.)
Ok, here is a comparison with a popcnt:
purecircuit cycles per 16-bit word: 0.260
avx512popcnt cycles per 16-bit word: 0.087
So the positional popcnt appears to be only about 3 times slower than a pure popcnt.
Note that avx512popcnt
uses AVX512 whereas purecircuit
is only AVX2... so it seems possible that an AVX512 version of purecircuit
would close the gap... which would be remarkable.
@lemire @aqrit It is amazing, I thought these partial sums would kill performance. :) BTW, it's worth to inspect assembly output for these scalar loops. GCC vectorizes it in a very clever way: https://gcc.godbolt.org/z/YMRTYU
@WojciechMula I wasn't sure it would be fast, but it is.
Note that it is only the first draft. I did not try to optimize it at all. As you can see, this is just a textbook implementation, without any thought.
I am aware that the autovectorizer is pretty good on these problems. We can still kill it in the end.
@lemire purecircuit
is very impressive!
@mklarqvist And I bet it can be improved!!!
Updated schemes for mula3
and csa
bringing the new low down to ~0.165 cycles / int.
[marcus@nuc FastFlagStats]$ ./instrumented_benchmark
n = 10000000
pospopcnt_u16_scalar_naive_nosimd cycles per 16-bit word: 17.521
pospopcnt_u16_scalar_naive cycles per 16-bit word: 2.058
pospopcnt_u16_scalar_partition cycles per 16-bit word: 3.074
pospopcnt_u16_hist1x4 cycles per 16-bit word: 2.852
pospopcnt_u16_sse_single cycles per 16-bit word: 3.614
pospopcnt_u16_sse_mula cycles per 16-bit word: 2.084
pospopcnt_u16_sse_mula_unroll4 cycles per 16-bit word: 1.579
pospopcnt_u16_sse_mula_unroll8 cycles per 16-bit word: 1.431
pospopcnt_u16_sse_mula_unroll16 cycles per 16-bit word: 1.384
pospopcnt_u16_avx2_popcnt cycles per 16-bit word: 2.402
pospopcnt_u16_avx2 cycles per 16-bit word: 2.034
pospopcnt_u16_avx2_naive_counter cycles per 16-bit word: 2.034
pospopcnt_u16_avx2_single cycles per 16-bit word: 2.030
pospopcnt_u16_avx2_lemire cycles per 16-bit word: 2.865
pospopcnt_u16_avx2_lemire2 cycles per 16-bit word: 1.701
pospopcnt_u16_avx2_mula cycles per 16-bit word: 1.104
pospopcnt_u16_avx2_mula2 cycles per 16-bit word: 1.468
pospopcnt_u16_avx2_mula3 cycles per 16-bit word: 0.432
pospopcnt_u16_avx2_mula_unroll4 cycles per 16-bit word: 0.845
pospopcnt_u16_avx2_mula_unroll8 cycles per 16-bit word: 0.756
pospopcnt_u16_avx2_mula_unroll16 cycles per 16-bit word: 0.735
pospopcnt_u16_avx512 cycles per 16-bit word: 1.512
pospopcnt_u16_avx512_popcnt32_mask cycles per 16-bit word: 0.821
pospopcnt_u16_avx512_popcnt64_mask cycles per 16-bit word: 0.837
pospopcnt_u16_avx512_popcnt cycles per 16-bit word: 1.677
pospopcnt_u16_avx512_mula cycles per 16-bit word: 0.752
pospopcnt_u16_avx512_mula_unroll4 cycles per 16-bit word: 0.623
pospopcnt_u16_avx512_mula_unroll8 cycles per 16-bit word: 0.555
pospopcnt_u16_avx2_mula3 cycles per 16-bit word: 0.431
pospopcnt_u16_avx512_mula3 cycles per 16-bit word: 0.289
pospopcnt_u16_avx2_csa cycles per 16-bit word: 0.258
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.165
The new
pospopcnt_u16_avx2_mula3
is the fastest so far. The only caveat is that it is not quite correct. Also it is not ported to AVX-512. With some luck it should be able to achieve 0.3 cycles per 16-bit word using AVX-512.cc @WojciechMula