If you have a Linux box (with kernel support for counters) and it has AVX-512 instructions, then with this PR you can do make itest. It will run the following benchmark... There are flags as well...
$ ./instrumented_benchmark
n = 10000000
pospopcnt_u16_scalar_naive_nosimd cycles per 16-bit word: 17.657
pospopcnt_u16_scalar_naive cycles per 16-bit word: 3.025
pospopcnt_u16_scalar_partition cycles per 16-bit word: 3.120
pospopcnt_u16_hist1x4 cycles per 16-bit word: 2.994
pospopcnt_u16_sse_single cycles per 16-bit word: 4.119
pospopcnt_u16_sse_mula cycles per 16-bit word: 1.595
pospopcnt_u16_sse_mula_unroll4 cycles per 16-bit word: 1.381
pospopcnt_u16_sse_mula_unroll8 cycles per 16-bit word: 1.466
pospopcnt_u16_sse_mula_unroll16 cycles per 16-bit word: 1.395
pospopcnt_u16_avx2_popcnt cycles per 16-bit word: 2.411
pospopcnt_u16_avx2 cycles per 16-bit word: 3.012
pospopcnt_u16_avx2_naive_counter cycles per 16-bit word: 3.013
pospopcnt_u16_avx2_single cycles per 16-bit word: 3.013
pospopcnt_u16_avx2_lemire cycles per 16-bit word: 2.149
pospopcnt_u16_avx2_lemire2 cycles per 16-bit word: 1.096
pospopcnt_u16_avx2_mula cycles per 16-bit word: 0.898
pospopcnt_u16_avx2_mula_unroll4 cycles per 16-bit word: 0.711
pospopcnt_u16_avx2_mula_unroll8 cycles per 16-bit word: 0.739
pospopcnt_u16_avx2_mula_unroll16 cycles per 16-bit word: 0.724
pospopcnt_u16_avx512 cycles per 16-bit word: 1.512
pospopcnt_u16_avx512_popcnt32_mask cycles per 16-bit word: 0.852
pospopcnt_u16_avx512_popcnt64_mask cycles per 16-bit word: 0.828
pospopcnt_u16_avx512_popcnt cycles per 16-bit word: 1.668
pospopcnt_u16_avx512_mula cycles per 16-bit word: 0.607
pospopcnt_u16_avx512_mula_unroll4 cycles per 16-bit word: 0.570
pospopcnt_u16_avx512_mula_unroll8 cycles per 16-bit word: 0.56
You may notice that pospopcnt_u16_scalar_naive_nosimd, it is a true scalar implementation. Your scalar implementations get vectorized with some compilers, which throws off your benchmarks since you are actually competing against the autovectorizer.
That is not too exciting but what is more helpful is the application of the -v flag:
First thing to notice is that, importantly, is that it provides both the mean and minimal counters. These should agree. Any large disagreement (more than 5%) should be viewed with suspicion.
Note also that, to a good approximation, the better speeds are achieved by reducing the instruction count.
My knights landing box does not have performance counters, so it is a bit useless for this tool.
If you have a Linux box (with kernel support for counters) and it has AVX-512 instructions, then with this PR you can do
make itest
. It will run the following benchmark... There are flags as well...You may notice that
pospopcnt_u16_scalar_naive_nosimd
, it is a true scalar implementation. Your scalar implementations get vectorized with some compilers, which throws off your benchmarks since you are actually competing against the autovectorizer.That is not too exciting but what is more helpful is the application of the
-v
flag:First thing to notice is that, importantly, is that it provides both the mean and minimal counters. These should agree. Any large disagreement (more than 5%) should be viewed with suspicion.
Note also that, to a good approximation, the better speeds are achieved by reducing the instruction count.
My knights landing box does not have performance counters, so it is a bit useless for this tool.