mklarqvist / positional-popcount

Fast C functions for the computing the positional popcount (pospopcnt).
Apache License 2.0
53 stars 5 forks source link

Faster 32-bit pospopcnt #37

Closed WojciechMula closed 4 years ago

WojciechMula commented 4 years ago

Our first 32-bit pospopcnt procedures are merely naive translations of 16-bit procedures. While Harley-Seal algorithm cannot be simplified further, updates of vector of MSBs (v16) was done in a simple loop. I come up with something a bit faster. (Maybe this approach can be applied also to 16-bit pospopcnt?)

cc @lemire

Below are benchmark results from Skylake-X machine:

Will test 4096 flags (8 bit proc: 4kB, 16 bit proc: 8kB, 32-bit proc: 16kB) repeated 10 times.
pospopcnt_u32_scalar_naive                           57494.40
pospopcnt_u32_sse_harley_seal                        11480.80
pospopcnt_u32_sse_harley_seal_improved                4878.80
pospopcnt_u32_avx2_harley_seal                        4976.50
pospopcnt_u32_avx2_harley_improved                    4346.60
Will test 8192 flags (8 bit proc: 8kB, 16 bit proc: 16kB, 32-bit proc: 32kB) repeated 10 times.
pospopcnt_u32_scalar_naive                          129178.00
pospopcnt_u32_sse_harley_seal                        22354.50
pospopcnt_u32_sse_harley_seal_improved               13806.20
pospopcnt_u32_avx2_harley_seal                        9729.40
pospopcnt_u32_avx2_harley_improved                    7187.30
Will test 16384 flags (8 bit proc: 16kB, 16 bit proc: 32kB, 32-bit proc: 64kB) repeated 10 times.
pospopcnt_u32_scalar_naive                          261518.70
pospopcnt_u32_sse_harley_seal                        32424.60
pospopcnt_u32_sse_harley_seal_improved               17789.00
pospopcnt_u32_avx2_harley_seal                       14156.50
pospopcnt_u32_avx2_harley_improved                   10143.00
Will test 32768 flags (8 bit proc: 32kB, 16 bit proc: 64kB, 32-bit proc: 128kB) repeated 10 times.
pospopcnt_u32_scalar_naive                          451576.50
pospopcnt_u32_sse_harley_seal                        42470.10
pospopcnt_u32_sse_harley_seal_improved               25285.70
pospopcnt_u32_avx2_harley_seal                       28344.20
pospopcnt_u32_avx2_harley_improved                   15666.20
Will test 65536 flags (8 bit proc: 64kB, 16 bit proc: 128kB, 32-bit proc: 256kB) repeated 10 times.
pospopcnt_u32_scalar_naive                          778506.90
pospopcnt_u32_sse_harley_seal                        75455.20
pospopcnt_u32_sse_harley_seal_improved               48105.10
pospopcnt_u32_avx2_harley_seal                       49102.10
pospopcnt_u32_avx2_harley_improved                   29535.20
WojciechMula commented 4 years ago

There's a detailed description in a comment, but I don't know how to link to specific line in a diff.

mklarqvist commented 4 years ago

Greay work @WojciechMula! The URL to the comment sections is: https://github.com/mklarqvist/positional-popcount/pull/37/files#diff-8faf5f851dc871bd75d3a606351b9b76R2948