Closed WojciechMula closed 5 years ago
Our main inner loop goes like this...
for (i = 0; i < (1<<16); i += 16) {
// stuff here
}
// new optimization here
So what is inside the main inner loop gets called thousands more times (4096) than the part you are trying to optimize.
For short inputs, this might matter but I am skeptical that for anything that spans a full inner loop (4096 * 64 bytes), you'll be able to milk more performance by avoiding our currently naive code... because you are not optimizing hot code.
Maybe I am wrong, but if so I'd like feedback.
Daniel, I'm aware that it's not on the critical path. Since GCC vectorizes the scalar loop anyway, I just wanted to check if vectorization of the whole step has sense. So the short answer is: for large data it might help on Skylake-X
@WojciechMula You report 5% gains. That's sizeable and surprising to me. It suggests something is wrong in my mental model. Certainly, 5% is worth having.
Looking at your pull request, it seems to me that you modified pospopcnt_u16_avx512bw_csa
but I think that this function only gets called on processors lacking avx512vbmi
. That is, it will get called on a Skylake X processor, but not on a Cannon Lake.
Thoughts?
Looking at your pull request, it seems to me that you modified
pospopcnt_u16_avx512bw_csa
but I think that this function only gets called on processors lackingavx512vbmi
.
Both avx512vbmi
and avx512bw
subroutines gets called separately in the the benchmarking programs but only one of them if you're using the higher-level wrapper function.
I'll dig into the code in this PR: 5% is a noticeable increase.
The updated algorithm does not produce correct output:
pospopcnt_u16 bug:
expected : 99868 100349 100279 99512 100323 99794 99960 100554 100043 100138 99847 99970 99884 100274 99668 99731
got : 99704 100167 100117 99354 100163 99640 99802 100414 99886 99962 99696 99812 99723 100109 99505 99590
Problem detected with pospopcnt_u16.
@mklarqvist You are correct my above objection does not hold in the instrumented tests.
@WojciechMula I will close this PR as it will now conflict with the function names in master
and do not produce correct results. Please create another PR with an updated version.
@mklarqvist @lemire I fixed this algorithm, but it appears that the modification does not bring any speedup. I forgot to load the initial flags state and this probably let the compiler to remove the initial for-loop. Sorry for the buzz.
@mklarqvist @lemire I was wondering if it's possible to sum all 16-bit counters solely with SIMD instructions. This MR contains a draft of solution. It works with my simple test, however, our benchmark fails.
On Cannon Lake I didn't observe any speedup. On Skylake-X there's decrease from 0.118 to 0.111 cycles per word.