New scheme - Githubissues

lemire commented 5 years ago

The new pospopcnt_u16_avx2_mula3 is the fastest so far. The only caveat is that it is not quite correct. Also it is not ported to AVX-512. With some luck it should be able to achieve 0.3 cycles per 16-bit word using AVX-512.

pospopcnt_u16_avx2_mula3                    cycles per 16-bit word:  0.381
(...)
pospopcnt_u16_avx512_mula                   cycles per 16-bit word:  0.686
pospopcnt_u16_avx512_mula_unroll4           cycles per 16-bit word:  0.582
pospopcnt_u16_avx512_mula_unroll8           cycles per 16-bit word:  0.585

cc @WojciechMula

mklarqvist commented 5 years ago

Thanks @lemire. Great work @WojciechMula! I'll have to address correctness before merging this PR. I'll also port it to AVX-512.

aqrit commented 5 years ago

Isn't the "end game" a Carry-Save-Adder network again? With only the overflow from that going to these functions... which would also mean that the best solution would be one that uses the fewest possible vector registers?

@WojciechMula

WojciechMula commented 5 years ago

@aqrit I was thinking about this, but the problem is with accumulation of the final vector. In regula popcount we simply mul-add them together, here we'd need to obtain several sums.

lemire commented 5 years ago

I have a new scheme temporarily called purecircuit. It is only AVX2 and seems to beat everything else. As suggested by @aqrit it is based on a CSA circuint.

cc @WojciechMula

pospopcnt_u16_avx512_mula_unroll4           cycles per 16-bit word:  0.584
pospopcnt_u16_avx512_mula_unroll8           cycles per 16-bit word:  0.584
purecircuit                                 cycles per 16-bit word:  0.261

I even find the speed hard to believe... but there you go.

(It is not yet integrated in the main code base, I just appended it in a dirty manner to benchmark/linux/instrumented_benchmark.cpp. The name is obviously not good (should probably call it csa something.)

lemire commented 5 years ago

Ok, here is a comparison with a popcnt:

purecircuit                                 cycles per 16-bit word:  0.260
avx512popcnt                                cycles per 16-bit word:  0.087

So the positional popcnt appears to be only about 3 times slower than a pure popcnt.

lemire commented 5 years ago

Note that avx512popcnt uses AVX512 whereas purecircuit is only AVX2... so it seems possible that an AVX512 version of purecircuit would close the gap... which would be remarkable.

WojciechMula commented 5 years ago

@lemire @aqrit It is amazing, I thought these partial sums would kill performance. :) BTW, it's worth to inspect assembly output for these scalar loops. GCC vectorizes it in a very clever way: https://gcc.godbolt.org/z/YMRTYU

lemire commented 5 years ago

@WojciechMula I wasn't sure it would be fast, but it is.

Note that it is only the first draft. I did not try to optimize it at all. As you can see, this is just a textbook implementation, without any thought.

I am aware that the autovectorizer is pretty good on these problems. We can still kill it in the end.

mklarqvist commented 5 years ago

@lemire purecircuit is very impressive!

lemire commented 5 years ago

@mklarqvist And I bet it can be improved!!!

mklarqvist commented 5 years ago

Updated schemes for mula3 and csa bringing the new low down to ~0.165 cycles / int.

[marcus@nuc FastFlagStats]$ ./instrumented_benchmark 
n = 10000000 
pospopcnt_u16_scalar_naive_nosimd           cycles per 16-bit word:  17.521 
pospopcnt_u16_scalar_naive                  cycles per 16-bit word:  2.058 
pospopcnt_u16_scalar_partition              cycles per 16-bit word:  3.074 
pospopcnt_u16_hist1x4                       cycles per 16-bit word:  2.852 
pospopcnt_u16_sse_single                    cycles per 16-bit word:  3.614 
pospopcnt_u16_sse_mula                      cycles per 16-bit word:  2.084 
pospopcnt_u16_sse_mula_unroll4              cycles per 16-bit word:  1.579 
pospopcnt_u16_sse_mula_unroll8              cycles per 16-bit word:  1.431 
pospopcnt_u16_sse_mula_unroll16             cycles per 16-bit word:  1.384 
pospopcnt_u16_avx2_popcnt                   cycles per 16-bit word:  2.402 
pospopcnt_u16_avx2                          cycles per 16-bit word:  2.034 
pospopcnt_u16_avx2_naive_counter            cycles per 16-bit word:  2.034 
pospopcnt_u16_avx2_single                   cycles per 16-bit word:  2.030 
pospopcnt_u16_avx2_lemire                   cycles per 16-bit word:  2.865 
pospopcnt_u16_avx2_lemire2                  cycles per 16-bit word:  1.701 
pospopcnt_u16_avx2_mula                     cycles per 16-bit word:  1.104 
pospopcnt_u16_avx2_mula2                    cycles per 16-bit word:  1.468 
pospopcnt_u16_avx2_mula3                    cycles per 16-bit word:  0.432 
pospopcnt_u16_avx2_mula_unroll4             cycles per 16-bit word:  0.845 
pospopcnt_u16_avx2_mula_unroll8             cycles per 16-bit word:  0.756 
pospopcnt_u16_avx2_mula_unroll16            cycles per 16-bit word:  0.735 
pospopcnt_u16_avx512                        cycles per 16-bit word:  1.512 
pospopcnt_u16_avx512_popcnt32_mask          cycles per 16-bit word:  0.821 
pospopcnt_u16_avx512_popcnt64_mask          cycles per 16-bit word:  0.837 
pospopcnt_u16_avx512_popcnt                 cycles per 16-bit word:  1.677 
pospopcnt_u16_avx512_mula                   cycles per 16-bit word:  0.752 
pospopcnt_u16_avx512_mula_unroll4           cycles per 16-bit word:  0.623 
pospopcnt_u16_avx512_mula_unroll8           cycles per 16-bit word:  0.555 
pospopcnt_u16_avx2_mula3                    cycles per 16-bit word:  0.431 
pospopcnt_u16_avx512_mula3                  cycles per 16-bit word:  0.289 
pospopcnt_u16_avx2_csa                      cycles per 16-bit word:  0.258 
pospopcnt_u16_avx512_csa                    cycles per 16-bit word:  0.165

mklarqvist / positional-popcount

New scheme #6