Closed aqrit closed 5 years ago
Thanks!
Thanks @aqrit. I will add it.
Ugh! If it were unrolled 2x... the second uint64_t could be processed for ~half-price. 4-bit/12-bit counters instead of 3-bit/12-bit counters.
Ha! Great timing. I am literally computing the performance profile to paste here as we speak.
@aqrit Unrolling seems to reduce performance. Haswell machine:
n = 1000000
iterations = 10
array size: 1.907 MB
avx256popcnt cycles per 16-bit word: 0.316; ref cycles per 16-bit word: 0.259
pospopcnt_u16 cycles per 16-bit word: 0.344; ref cycles per 16-bit word: 0.275
pospopcnt_u16_scalar_naive cycles per 16-bit word: 3.936; ref cycles per 16-bit word: 3.463
pospopcnt_u16_scalar_naive_nosimd cycles per 16-bit word: 18.055; ref cycles per 16-bit word: 14.282
pospopcnt_u16_scalar_partition cycles per 16-bit word: 3.364; ref cycles per 16-bit word: 2.643
pospopcnt_u16_scalar_hist1x4 cycles per 16-bit word: 3.124; ref cycles per 16-bit word: 2.456
pospopcnt_u16_scalar_umul128 cycles per 16-bit word: 2.415; ref cycles per 16-bit word: 1.898
pospopcnt_u16_scalar_umul128_unroll2 cycles per 16-bit word: 2.969; ref cycles per 16-bit word: 2.333
pospopcnt_u16_sse_single cycles per 16-bit word: 4.308; ref cycles per 16-bit word: 3.385
pospopcnt_u16_sse_mula cycles per 16-bit word: 2.135; ref cycles per 16-bit word: 1.740
pospopcnt_u16_sse_mula_unroll4 cycles per 16-bit word: 1.736; ref cycles per 16-bit word: 1.428
pospopcnt_u16_sse_mula_unroll8 cycles per 16-bit word: 1.535; ref cycles per 16-bit word: 1.251
pospopcnt_u16_sse_mula_unroll16 cycles per 16-bit word: 1.536; ref cycles per 16-bit word: 1.208
pospopcnt_u16_sse2_sad cycles per 16-bit word: 1.384; ref cycles per 16-bit word: 1.116
pospopcnt_u16_sse2_csa cycles per 16-bit word: 0.428; ref cycles per 16-bit word: 0.357
pospopcnt_u16_avx2_popcnt cycles per 16-bit word: 3.133; ref cycles per 16-bit word: 2.469
pospopcnt_u16_avx2 cycles per 16-bit word: 4.074; ref cycles per 16-bit word: 3.225
pospopcnt_u16_avx2_naive_counter cycles per 16-bit word: 4.002; ref cycles per 16-bit word: 3.144
pospopcnt_u16_avx2_single cycles per 16-bit word: 3.704; ref cycles per 16-bit word: 2.920
pospopcnt_u16_avx2_lemire cycles per 16-bit word: 2.193; ref cycles per 16-bit word: 1.733
pospopcnt_u16_avx2_lemire2 cycles per 16-bit word: 1.502; ref cycles per 16-bit word: 1.191
pospopcnt_u16_avx2_mula cycles per 16-bit word: 1.426; ref cycles per 16-bit word: 1.135
pospopcnt_u16_avx2_mula_unroll4 cycles per 16-bit word: 0.960; ref cycles per 16-bit word: 0.772
pospopcnt_u16_avx2_mula_unroll8 cycles per 16-bit word: 0.816; ref cycles per 16-bit word: 0.641
pospopcnt_u16_avx2_mula_unroll16 cycles per 16-bit word: 0.844; ref cycles per 16-bit word: 0.663
pospopcnt_u16_avx2_mula3 cycles per 16-bit word: 0.517; ref cycles per 16-bit word: 0.440
pospopcnt_u16_avx2_csa cycles per 16-bit word: 0.329; ref cycles per 16-bit word: 0.275
unroll + rewrite (which I will do)
the 2nd u64 would cost: 1 load 4 and 5 add 1 mul edit: 1 shr
updated the gist w/unrolled version https://gist.github.com/aqrit/c729815b0165c139d0bac642ab7ee104
Same trick could be applied to sse2_sad ... I've got 12 extra psadbw instructions in there. /facepalm
We can add a third u64 into pospopcnt_u16_scalar_umul128 on the cheap. (sorry for spamming)
Thanks @aqrit . It is indeed slightly faster.
pospopcnt_u16_scalar_umul128 instructions per cycle 3.53, cycles per 16-bit word: 2.414, instructions per 16-bit word 8.528
min: 2413517 cycles, 8527767 instructions, 248 branch mis., 8206 cache ref., 0 cache mis.
avg: 2426353.7 cycles, 8527767.2 instructions, 248.8 branch mis., 11234.4 cache ref., 1.0 cache mis.
pospopcnt_u16_scalar_umul128_unroll2 instructions per cycle 2.94, cycles per 16-bit word: 2.389, instructions per 16-bit word 7.027
min: 2389269 cycles, 7026782 instructions, 248 branch mis., 10193 cache ref., 0 cache mis.
avg: 2400136.8 cycles, 7026782.2 instructions, 249.3 branch mis., 11906.7 cache ref., 0.8 cache mis.
Now added.
If this project wants to collect all possible methods, here is another one: https://gist.github.com/aqrit/c729815b0165c139d0bac642ab7ee104
Expected to be 37.5% slower than sse2_sad.