Some fixes + instrumented tests

If you have a Linux box (with kernel support for counters) and it has AVX-512 instructions, then with this PR you can do make itest. It will run the following benchmark... There are flags as well...

$ ./instrumented_benchmark 
n = 10000000 
pospopcnt_u16_scalar_naive_nosimd           cycles per 16-bit word:  17.657 
pospopcnt_u16_scalar_naive                  cycles per 16-bit word:  3.025 
pospopcnt_u16_scalar_partition              cycles per 16-bit word:  3.120 
pospopcnt_u16_hist1x4                       cycles per 16-bit word:  2.994 
pospopcnt_u16_sse_single                    cycles per 16-bit word:  4.119 
pospopcnt_u16_sse_mula                      cycles per 16-bit word:  1.595 
pospopcnt_u16_sse_mula_unroll4              cycles per 16-bit word:  1.381 
pospopcnt_u16_sse_mula_unroll8              cycles per 16-bit word:  1.466 
pospopcnt_u16_sse_mula_unroll16             cycles per 16-bit word:  1.395 
pospopcnt_u16_avx2_popcnt                   cycles per 16-bit word:  2.411 
pospopcnt_u16_avx2                          cycles per 16-bit word:  3.012 
pospopcnt_u16_avx2_naive_counter            cycles per 16-bit word:  3.013 
pospopcnt_u16_avx2_single                   cycles per 16-bit word:  3.013 
pospopcnt_u16_avx2_lemire                   cycles per 16-bit word:  2.149 
pospopcnt_u16_avx2_lemire2                  cycles per 16-bit word:  1.096 
pospopcnt_u16_avx2_mula                     cycles per 16-bit word:  0.898 
pospopcnt_u16_avx2_mula_unroll4             cycles per 16-bit word:  0.711 
pospopcnt_u16_avx2_mula_unroll8             cycles per 16-bit word:  0.739 
pospopcnt_u16_avx2_mula_unroll16            cycles per 16-bit word:  0.724 
pospopcnt_u16_avx512                        cycles per 16-bit word:  1.512 
pospopcnt_u16_avx512_popcnt32_mask          cycles per 16-bit word:  0.852 
pospopcnt_u16_avx512_popcnt64_mask          cycles per 16-bit word:  0.828 
pospopcnt_u16_avx512_popcnt                 cycles per 16-bit word:  1.668 
pospopcnt_u16_avx512_mula                   cycles per 16-bit word:  0.607 
pospopcnt_u16_avx512_mula_unroll4           cycles per 16-bit word:  0.570 
pospopcnt_u16_avx512_mula_unroll8           cycles per 16-bit word:  0.56

You may notice that pospopcnt_u16_scalar_naive_nosimd, it is a true scalar implementation. Your scalar implementations get vectorized with some compilers, which throws off your benchmarks since you are actually competing against the autovectorizer.

That is not too exciting but what is more helpful is the application of the -v flag:

$ ./instrumented_benchmark -v
n = 10000000 
pospopcnt_u16_scalar_naive_nosimd           instructions per cycle 3.80, cycles per 16-bit word:  17.655, instructions per 16-bit word 67.002 
min:   680216 cycles,  2581463 instructions,           2 branch mis.,       13 cache ref.,        0 cache mis.
avg: 684797.1 cycles, 2581463.2 instructions,        2.1 branch mis.,     54.7 cache ref.,      0.8 cache mis.

pospopcnt_u16_scalar_naive                  instructions per cycle 2.69, cycles per 16-bit word:  3.025, instructions per 16-bit word 8.132 
min:   116541 cycles,   313320 instructions,           2 branch mis.,        8 cache ref.,        0 cache mis.
avg: 118741.9 cycles, 313320.0 instructions,         3.0 branch mis.,    106.5 cache ref.,      1.1 cache mis.

pospopcnt_u16_scalar_partition              instructions per cycle 2.55, cycles per 16-bit word:  3.160, instructions per 16-bit word 8.068 
min:   121759 cycles,   310862 instructions,           3 branch mis.,        2 cache ref.,        0 cache mis.
avg: 123595.7 cycles, 310862.0 instructions,         3.2 branch mis.,     21.1 cache ref.,      0.4 cache mis.

pospopcnt_u16_hist1x4                       instructions per cycle 1.92, cycles per 16-bit word:  3.037, instructions per 16-bit word 5.819 
min:   117013 cycles,   224177 instructions,           3 branch mis.,        8 cache ref.,        0 cache mis.
avg: 120080.2 cycles, 224177.0 instructions,         3.2 branch mis.,     30.9 cache ref.,      0.1 cache mis.

pospopcnt_u16_sse_single                    instructions per cycle 2.64, cycles per 16-bit word:  4.095, instructions per 16-bit word 10.792 
min:   157756 cycles,   415808 instructions,           2 branch mis.,       25 cache ref.,        0 cache mis.
avg: 160808.5 cycles, 415808.1 instructions,         5.0 branch mis.,     57.1 cache ref.,      0.5 cache mis.

pospopcnt_u16_sse_mula                      instructions per cycle 3.68, cycles per 16-bit word:  1.598, instructions per 16-bit word 5.877 
min:    61560 cycles,   226434 instructions,           2 branch mis.,        9 cache ref.,        0 cache mis.
avg:  61746.7 cycles, 226434.0 instructions,         2.1 branch mis.,     24.5 cache ref.,      0.3 cache mis.

pospopcnt_u16_sse_mula_unroll4              instructions per cycle 3.78, cycles per 16-bit word:  1.375, instructions per 16-bit word 5.190 
min:    52959 cycles,   199953 instructions,           2 branch mis.,        3 cache ref.,        0 cache mis.
avg:  53160.7 cycles, 199953.0 instructions,         2.1 branch mis.,     15.7 cache ref.,      0.3 cache mis.

pospopcnt_u16_sse_mula_unroll8              instructions per cycle 4.06, cycles per 16-bit word:  1.466, instructions per 16-bit word 5.955 
min:    56490 cycles,   229428 instructions,           2 branch mis.,        1 cache ref.,        0 cache mis.
avg:  56709.1 cycles, 229428.0 instructions,         2.3 branch mis.,     23.6 cache ref.,      0.4 cache mis.

pospopcnt_u16_sse_mula_unroll16             instructions per cycle 3.62, cycles per 16-bit word:  1.398, instructions per 16-bit word 5.064 
min:    53843 cycles,   195124 instructions,           2 branch mis.,       28 cache ref.,        0 cache mis.
avg:  54045.4 cycles, 195124.0 instructions,         2.2 branch mis.,     48.2 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx2_popcnt                   instructions per cycle 2.65, cycles per 16-bit word:  2.415, instructions per 16-bit word 6.407 
min:    93035 cycles,   246838 instructions,           3 branch mis.,       53 cache ref.,        0 cache mis.
avg:  94128.3 cycles, 246838.0 instructions,         4.0 branch mis.,    113.6 cache ref.,      2.4 cache mis.

pospopcnt_u16_avx2                          instructions per cycle 2.70, cycles per 16-bit word:  3.013, instructions per 16-bit word 8.133 
min:   116092 cycles,   313335 instructions,           1 branch mis.,       13 cache ref.,        0 cache mis.
avg: 116270.3 cycles, 313335.0 instructions,         3.0 branch mis.,     37.1 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx2_naive_counter            instructions per cycle 2.70, cycles per 16-bit word:  3.013, instructions per 16-bit word 8.133 
min:   116080 cycles,   313337 instructions,           1 branch mis.,       25 cache ref.,        0 cache mis.
avg: 116237.0 cycles, 313337.0 instructions,         2.7 branch mis.,     51.0 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx2_single                   instructions per cycle 2.70, cycles per 16-bit word:  3.013, instructions per 16-bit word 8.133 
min:   116088 cycles,   313346 instructions,           2 branch mis.,       15 cache ref.,        0 cache mis.
avg: 116175.2 cycles, 313346.0 instructions,         2.8 branch mis.,     49.3 cache ref.,      0.2 cache mis.

pospopcnt_u16_avx2_lemire                   instructions per cycle 2.80, cycles per 16-bit word:  2.145, instructions per 16-bit word 6.004 
min:    82633 cycles,   231338 instructions,           1 branch mis.,       11 cache ref.,        0 cache mis.
avg:  83155.7 cycles, 231338.0 instructions,         1.1 branch mis.,     36.6 cache ref.,      0.3 cache mis.

pospopcnt_u16_avx2_lemire2                  instructions per cycle 3.08, cycles per 16-bit word:  1.097, instructions per 16-bit word 3.381 
min:    42264 cycles,   130270 instructions,           1 branch mis.,       22 cache ref.,        0 cache mis.
avg:  42541.3 cycles, 130270.0 instructions,         1.1 branch mis.,     49.0 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx2_mula                     instructions per cycle 3.30, cycles per 16-bit word:  0.890, instructions per 16-bit word 2.940 
min:    34292 cycles,   113260 instructions,           2 branch mis.,       15 cache ref.,        0 cache mis.
avg:  35494.9 cycles, 113260.0 instructions,         2.1 branch mis.,     36.7 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx2_mula_unroll4             instructions per cycle 3.65, cycles per 16-bit word:  0.710, instructions per 16-bit word 2.596 
min:    27373 cycles,   100023 instructions,           1 branch mis.,       19 cache ref.,        0 cache mis.
avg:  27582.9 cycles, 100023.0 instructions,         1.7 branch mis.,     49.2 cache ref.,      0.3 cache mis.

pospopcnt_u16_avx2_mula_unroll8             instructions per cycle 4.03, cycles per 16-bit word:  0.740, instructions per 16-bit word 2.978 
min:    28494 cycles,   114748 instructions,           2 branch mis.,       18 cache ref.,        0 cache mis.
avg:  28590.2 cycles, 114748.0 instructions,         2.2 branch mis.,     42.1 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx2_mula_unroll16            instructions per cycle 3.51, cycles per 16-bit word:  0.723, instructions per 16-bit word 2.535 
min:    27847 cycles,    97665 instructions,           2 branch mis.,       19 cache ref.,        0 cache mis.
avg:  28077.5 cycles,  97665.0 instructions,         2.2 branch mis.,     42.9 cache ref.,      0.3 cache mis.

pospopcnt_u16_avx512                        instructions per cycle 2.08, cycles per 16-bit word:  1.509, instructions per 16-bit word 3.135 
min:    58156 cycles,   120795 instructions,           1 branch mis.,       42 cache ref.,        0 cache mis.
avg:  58301.7 cycles, 120795.0 instructions,         2.0 branch mis.,     82.5 cache ref.,      0.6 cache mis.

pospopcnt_u16_avx512_popcnt32_mask          instructions per cycle 3.08, cycles per 16-bit word:  0.853, instructions per 16-bit word 2.629 
min:    32882 cycles,   101289 instructions,           1 branch mis.,       22 cache ref.,        0 cache mis.
avg:  33063.2 cycles, 101289.0 instructions,         2.0 branch mis.,     52.4 cache ref.,      0.4 cache mis.

pospopcnt_u16_avx512_popcnt64_mask          instructions per cycle 3.14, cycles per 16-bit word:  0.827, instructions per 16-bit word 2.597 
min:    31879 cycles,   100056 instructions,           1 branch mis.,       34 cache ref.,        0 cache mis.
avg:  32214.1 cycles, 100056.0 instructions,         2.9 branch mis.,     79.9 cache ref.,      0.5 cache mis.

pospopcnt_u16_avx512_popcnt                 instructions per cycle 1.91, cycles per 16-bit word:  1.665, instructions per 16-bit word 3.178 
min:    64153 cycles,   122446 instructions,           3 branch mis.,       72 cache ref.,        0 cache mis.
avg:  64254.9 cycles, 122446.0 instructions,         3.2 branch mis.,    108.9 cache ref.,      1.1 cache mis.

pospopcnt_u16_avx512_mula                   instructions per cycle 2.80, cycles per 16-bit word:  0.610, instructions per 16-bit word 1.705 
min:    23484 cycles,    65702 instructions,           1 branch mis.,       29 cache ref.,        0 cache mis.
avg:  23740.5 cycles,  65702.0 instructions,         1.5 branch mis.,     59.1 cache ref.,      0.2 cache mis.

pospopcnt_u16_avx512_mula_unroll4           instructions per cycle 2.68, cycles per 16-bit word:  0.571, instructions per 16-bit word 1.534 
min:    22011 cycles,    59088 instructions,           1 branch mis.,       38 cache ref.,        0 cache mis.
avg:  22100.7 cycles,  59088.0 instructions,         1.3 branch mis.,     71.5 cache ref.,      0.3 cache mis.

pospopcnt_u16_avx512_mula_unroll8           instructions per cycle 3.07, cycles per 16-bit word:  0.562, instructions per 16-bit word 1.725 
min:    21640 cycles,    66443 instructions,           1 branch mis.,       19 cache ref.,        0 cache mis.
avg:  21834.3 cycles,  66443.0 instructions,         2.3 branch mis.,     59.8 cache ref.,      0.4 cache mis.

First thing to notice is that, importantly, is that it provides both the mean and minimal counters. These should agree. Any large disagreement (more than 5%) should be viewed with suspicion.

Note also that, to a good approximation, the better speeds are achieved by reducing the instruction count.

My knights landing box does not have performance counters, so it is a bit useless for this tool.

mklarqvist / positional-popcount

Some fixes + instrumented tests #4