Speed up VSX convolution code

The vpx_convolve8_vsx function is the most time consuming function of libVPX on POWER. For POWER8, 24% of the runtime is spent in vpx_convolve8_vpx, while in POWER9 that value increases to 30%. Taking the time to optimize even more this function will have considerable impact on the libVPX encoding speed on POWER.

This is the optimal place to optimize libVPX on POWER in order to maximize results. Doubling the speed of vpx_convolve8_vsx will reduce encoding time by 10 to 15%.

This includes the following functions:

[ ] convolve
[ ] convolve_horiz
[ ] convolve_line_h
[ ] convolve_vert
[ ] convolve_line_v

Testing:

[ ] Must pass the ConvolveTestSuite suite
[ ] Refactor ConvolveTestSuite to use the AbstractBench
[ ] Report performance in commit msg (compared to C version)
[ ] Show significant speedup over C version

lu-zero / libvpx

Speed up VSX convolution code #25