Now that the volk_8u_conv_k7_r2puppet_8u kernel has a working test (fixed in #736), we can safely make changes and be confident that the various protokernels are producing identical output. Here I've re-enabled the broken AVX2 convolutional decoder which was commented out in #458. To get identical output to the other protokernels, I made the following changes (each in a separate commit, for easier review):
Re-normalize the branch metrics on every iteration, to avoid integer overflow. (#736 did the same for the spiral and neonspiral protokernels, which was necessary to get identical output to the generic protokernel.)
Remove an extraneous permutation that was executed at the beginning of each iteration.
During re-normalization, compute the minimum branch metric over both AVX2 register lanes.
I tested with many different vector lengths (for v in {0..2048}; do echo $v; apps/volk_profile -n -R k7_r2puppet -v $v -i 1 2>&1 | grep fail; done), and did not observe any test failures.
Performance of the AVX2 protokernel is slightly worse than the spiral protokernel at the default 131071 vector length, but better at shorter vector lengths (e.g. 16384). I suspect that with some further tweaks, AVX2 performance could be improved, but I'll leave that for a future PR. In particular, increasing the metric shift (as was done in #475 but reverted in #736) may reduce the number of expensive re-normalizations that need to be performed. And perhaps the minimum calculation (which is not SIMD-friendly) could be removed from the re-normalization as well.
Reverts #458.
Now that the
volk_8u_conv_k7_r2puppet_8u
kernel has a working test (fixed in #736), we can safely make changes and be confident that the various protokernels are producing identical output. Here I've re-enabled the broken AVX2 convolutional decoder which was commented out in #458. To get identical output to the other protokernels, I made the following changes (each in a separate commit, for easier review):I tested with many different vector lengths (
for v in {0..2048}; do echo $v; apps/volk_profile -n -R k7_r2puppet -v $v -i 1 2>&1 | grep fail; done
), and did not observe any test failures.Performance of the AVX2 protokernel is slightly worse than the spiral protokernel at the default 131071 vector length, but better at shorter vector lengths (e.g. 16384). I suspect that with some further tweaks, AVX2 performance could be improved, but I'll leave that for a future PR. In particular, increasing the metric shift (as was done in #475 but reverted in #736) may reduce the number of expensive re-normalizations that need to be performed. And perhaps the minimum calculation (which is not SIMD-friendly) could be removed from the re-normalization as well.
/cc @Aang23