lu-zero / libvpx

Local libvpx changes (POWER8 Altivec/VSX support)
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

VSX Version of vpx_sad8x8 #22

Closed luctrudeau closed 6 years ago

luctrudeau commented 6 years ago

Implement a VSX version of vpx_sad8x8

Each function must:

luctrudeau commented 6 years ago

The speed tests for the SADTest suite have landed. https://chromium.googlesource.com/webm/libvpx/+/f950248b9b357b21e974e3ace94359d7ee8c7b29

The sad8x8 is currently in review

By changing how the absolute difference of sum is computed, we can stay in 8-bit lanes. This acceleration was also applied to all other sad block sizes and is currently in review. So expect speed ups for all block sizes.

luctrudeau commented 6 years ago

VSX Version of SAD8xN is now upstream https://chromium.googlesource.com/webm/libvpx/+/e3ce12cfc1c2d2cc245e1a6d49eaf3ff18538547

Speed Ups when compared to C are as follows: 8x4 C time = 68.7 ms (±0.3 ms), VSX time = 31.8 ms (±0.1 ms) [2.2x] 8x8 C time = 55.6 ms (±0.3 ms), VSX time = 18.3 ms (±0.1 ms) [3.0x] 8x16 C time = 46.5 ms (±0.1 ms), VSX time = 15.6 ms (±0.1 ms) [3.0x]

luctrudeau commented 6 years ago

The PROCESS16 macro now uses 8-bit lanes instead of 16-bit lanes. https://chromium.googlesource.com/webm/libvpx/+/f9dc411d89eed99d7def7de1e9dddba782c1212c

This results in Speed Ups for all other blocksizes, when compared to previous VSX code 16x8 Old VSX time = 16.7 ms, new VSX time = 9.1 ms [1.8x] 16x16 Old VSX time = 15.7 ms, new VSX time = 7.9 ms [2.0x] 16x32 Old VSX time = 14.4 ms, new VSX time = 7.2 ms [2.0x] 32x16 Old VSX time = 14.0 ms, new VSX time = 7.4 ms [1.9x] 32x32 Old VSX time = 13.4 ms, new VSX time = 6.5 ms [2.0x] 32x64 Old VSX time = 12.7 ms, new VSX time = 6.3 ms [2.0x] 64x32 Old VSX time = 12.6 ms, new VSX time = 6.3 ms [2.0x] 64x64 Old VSX time = 12.7 ms, new VSX time = 6.2 ms [2.0x]