Speed Up SADNxNx4D - Githubissues

More than 15% of the encoding time of libVPX on POWER is spent in the SADNxNx4D functions.

%	Function
10.63%	vpx_sad16x16x4d_vsx
3.60%	vpx_sad32x32x4d_vsx
3.22%	vpx_sad64x64x4d_vsx
1.12%	vpx_sad8x8x4d_c

Current VSX SAD implementations can be further optimized for considerable performance improvements. Doubling the speed of the SADNxNx4D functions would reduce encoding time by 5 to 8%.

This includes the following functions:

[ ] vpx_sad16x16x4d_vsx
[ ] vpx_sad32x32x4d_vsx
[ ] vpx_sad64x64x4d_vsx
[ ] vpx_sad8x8x4d_vsx
[ ] PROCESS16_4D
[ ] SAD8_4D
[ ] SAD16_4D
[ ] SAD32_4D
[ ] SAD64_4D

Testing:

[ ] Must pass the SADx4Test suite
[ ] Refactor SADx4Test to use the AbstractBench
[ ] Report performance in commit msg (compared to C version)
[ ] Show significant speedup over C version

lu-zero / libvpx

Speed Up SADNxNx4D #26