and putting back the standard popcnt which is important as it is a known lower bound on the cycle count.
We are now running at
pospopcnt_u16_avx512_csa cycles per 16-bit word: 0.166
avx512popcnt cycles per 16-bit word: 0.086
You can do the standard popcnt using about 0.344 cycles per 64-bit word (or 4 x 16 bits). It takes about 0.66 cycles to do the 16-bit pospopcnt of a 64-bit word.
If you try to do the same thing without SIMD instructions, it takes 70 cycles for the same task.
In other words, SIMD instructions make the code run 100 faster.
and putting back the standard popcnt which is important as it is a known lower bound on the cycle count.
We are now running at
You can do the standard popcnt using about 0.344 cycles per 64-bit word (or 4 x 16 bits). It takes about 0.66 cycles to do the 16-bit pospopcnt of a 64-bit word.
If you try to do the same thing without SIMD instructions, it takes 70 cycles for the same task.
In other words, SIMD instructions make the code run 100 faster.
@WojciechMula