AVX2 optimization for ARGBToI420

GoogleCodeExporter commented 9 years ago

Since conversions ToI420 are more important, here we proposed an AVX2 
optimization patch for ARGBToI420. It shifts 128 bit xmm registers to 256 bit 
ymm registers on platforms which support AVX2. The performance gain for one of 
the Haswell platforms is listed below:

Was(r556 in Haswell)
ARGBToI420_Any (634 ms)
ARGBToI420_Unaligned (692 ms)
ARGBToI420_Invert (606 ms)
ARGBToI420_Opt (590 ms)

Now(in same Haswell)
ARGBToI420_Any (589 ms)
ARGBToI420_Unaligned (527 ms)
ARGBToI420_Invert (531 ms)
ARGBToI420_Opt (505 ms)

Original issue reported on code.google.com by changjun...@intel.com on 5 Feb 2013 at 7:13

Attachments:

ARGBToI420_AVX2.patch

GoogleCodeExporter commented 9 years ago

Sweet!  Could you do I420ToARGB?

Original comment by fbarch...@chromium.org on 5 Feb 2013 at 9:32

GoogleCodeExporter commented 9 years ago

Sure. I would try that one if this patch looks well.

Original comment by changjun...@intel.com on 6 Feb 2013 at 1:31

GoogleCodeExporter commented 9 years ago

Put up for review.  Its generally good in form, but minor changes preferred
use macro for ifdef, not compiler version.
conditionally define macro.
unconditionally prototype
https://webrtc-codereview.appspot.com/1090005
types should be lower case

Original comment by fbarch...@chromium.org on 6 Feb 2013 at 9:49

Changed state: Started

GoogleCodeExporter commented 9 years ago

In the code review I have some changes and questions, if you dont mind.

For SSSE3 version, because you didn't clear the upper vectors, your speeds may 
be wrong.
I get:

d:\src\libyuv\trunk>out\release\libyuv_unittest --gtest_filter=*ARGBToI420*   | 
sed "s/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g"   |
c:\cygwin\bin\sort -rn   | grep ms
424 - [       OK ] libyuvTest.ARGBToI420_Unaligned (424 ms)
406 - [       OK ] libyuvTest.ARGBToI420_Any (406 ms)
383 - [       OK ] libyuvTest.ARGBToI420_Opt (383 ms)
380 - [       OK ] libyuvTest.ARGBToI420_Invert (380 ms)
[==========] 4 tests from 1 test case ran. (1593 ms total)

On an HP Z620 which E5-2690 (Sandy Bridge).  So I'd hope the Haswell can beat 
that.

Original comment by fbarch...@chromium.org on 6 Feb 2013 at 10:23

GoogleCodeExporter commented 9 years ago

before
d:\src\libyuv\trunk>more noavx.txt
[       OK ] libyuvTest.ARGBToI420_Any (515 ms)
[       OK ] libyuvTest.ARGBToI420_Unaligned (530 ms)
[       OK ] libyuvTest.ARGBToI420_Invert (499 ms)
[       OK ] libyuvTest.ARGBToI420_Opt (500 ms)
[----------] 4 tests from libyuvTest (2044 ms total)

after
[       OK ] libyuvTest.ARGBToI420_Any (468 ms)
[       OK ] libyuvTest.ARGBToI420_Unaligned (421 ms)
[       OK ] libyuvTest.ARGBToI420_Invert (406 ms)
[       OK ] libyuvTest.ARGBToI420_Opt (405 ms)
[----------] 4 tests from libyuvTest (1700 ms total)

20% faster overall.  Not a big win?

Original comment by fbarch...@chromium.org on 7 Feb 2013 at 7:01

GoogleCodeExporter commented 9 years ago

try bots say android breaks with this patch.  needs a little more work.

Original comment by fbarch...@chromium.org on 8 Feb 2013 at 6:19

GoogleCodeExporter commented 9 years ago

r566 checks in the initial code.
ARGBToY is complete.  ARGBToUV needs more work.

Original comment by fbarch...@chromium.org on 8 Feb 2013 at 11:05

GoogleCodeExporter commented 9 years ago

r567 removes vmovdqa from ARGBToUV.

Original comment by fbarch...@chromium.org on 8 Feb 2013 at 11:27

GoogleCodeExporter commented 9 years ago

r575 removes excess vpermq's.  5% faster.
SSSE3 4212 ms
AVX2 2964 ms

Original comment by fbarch...@google.com on 15 Feb 2013 at 6:59

Changed state: Fixed

watery01 / libyuv

AVX2 optimization for ARGBToI420 #181