mklarqvist / positional-popcount

Fast C functions for the computing the positional popcount (pospopcnt).
Apache License 2.0
53 stars 5 forks source link

Accumulate counters using "horizontal reduction" #19

Closed WojciechMula closed 5 years ago

WojciechMula commented 5 years ago

@mklarqvist @lemire I was wondering if it's possible to sum all 16-bit counters solely with SIMD instructions. This MR contains a draft of solution. It works with my simple test, however, our benchmark fails.

On Cannon Lake I didn't observe any speedup. On Skylake-X there's decrease from 0.118 to 0.111 cycles per word.

lemire commented 5 years ago

Our main inner loop goes like this...

for (i = 0; i < (1<<16); i += 16) {
   // stuff here
}
// new optimization here

So what is inside the main inner loop gets called thousands more times (4096) than the part you are trying to optimize.

For short inputs, this might matter but I am skeptical that for anything that spans a full inner loop (4096 * 64 bytes), you'll be able to milk more performance by avoiding our currently naive code... because you are not optimizing hot code.

Maybe I am wrong, but if so I'd like feedback.

WojciechMula commented 5 years ago

Daniel, I'm aware that it's not on the critical path. Since GCC vectorizes the scalar loop anyway, I just wanted to check if vectorization of the whole step has sense. So the short answer is: for large data it might help on Skylake-X

lemire commented 5 years ago

@WojciechMula You report 5% gains. That's sizeable and surprising to me. It suggests something is wrong in my mental model. Certainly, 5% is worth having.

Looking at your pull request, it seems to me that you modified pospopcnt_u16_avx512bw_csa but I think that this function only gets called on processors lacking avx512vbmi. That is, it will get called on a Skylake X processor, but not on a Cannon Lake.

Thoughts?

mklarqvist commented 5 years ago

Looking at your pull request, it seems to me that you modified pospopcnt_u16_avx512bw_csa but I think that this function only gets called on processors lacking avx512vbmi.

Both avx512vbmi and avx512bw subroutines gets called separately in the the benchmarking programs but only one of them if you're using the higher-level wrapper function.

I'll dig into the code in this PR: 5% is a noticeable increase.

mklarqvist commented 5 years ago

The updated algorithm does not produce correct output:

pospopcnt_u16                               bug:
expected :     99868    100349    100279     99512    100323     99794     99960    100554    100043    100138     99847     99970     99884    100274     99668     99731 
got      :     99704    100167    100117     99354    100163     99640     99802    100414     99886     99962     99696     99812     99723    100109     99505     99590 
Problem detected with pospopcnt_u16.
lemire commented 5 years ago

@mklarqvist You are correct my above objection does not hold in the instrumented tests.

mklarqvist commented 5 years ago

@WojciechMula I will close this PR as it will now conflict with the function names in master and do not produce correct results. Please create another PR with an updated version.

WojciechMula commented 5 years ago

@mklarqvist @lemire I fixed this algorithm, but it appears that the modification does not bring any speedup. I forgot to load the initial flags state and this probably let the compiler to remove the initial for-loop. Sorry for the buzz.