Conversion between ARGB and ABGR - Optimize for SSSE3

GoogleCodeExporter commented 9 years ago

The unaligned cases for the conversion between ARGB and ABGR are currently slow 
on X86 Windows platform, here proposed a patch to improve them.

Was:
ARGBToABGR_Unaligned (2094 ms)
ABGRToARGB_Unaligned (2094 ms)

Now:
ARGBToABGR_Unaligned (704 ms)
ABGRToARGB_Unaligned (688 ms)

Original issue reported on code.google.com by changjun...@intel.com on 6 Mar 2013 at 9:01

Attachments:

ABGRARGB.patch

GoogleCodeExporter commented 9 years ago

Thanks.  Most functions are optimized for the aligned case, as it helps Core 
and Atom performance, and theres not much good reason to allocate memory 
unaligned.
ARGB is often used for screencasting however, so odd sizes are more common.
Is this coming up in practice or just something noticed in the tests?
The downside of doing it is the additional alignment checks slightly hurt 
performance of the aligned case.

Prefer do all ARGB functions at once.
Off top of my head, there are 4 functions for completeness of ARGB to/from 
BGRA,ABGR and RGBA.  They are
ARGBToBGRA
ARGBToABGR
ARGBToRGBA
RGBAToARGB
And there are posix versions.  Overall there are 9 core RGB formats.
RGB24
RAW
RGB565
ARGB1555
ARGB4444
Once a function has Unaligned, it makes sense to do Any variations to handle 
odd widths.  Odd widths may have aligned rows via stride, but typically they 
dont.
Prefer add row coalescing at the same time.  If width=stride, you can treat it 
as a single row, which tends to be aligned.
Prefer do AVX2 version of these, which has free unaligned access.
The code for all of these are identical, aside from the constant.  Seems like a 
common function or macro would help.

Original comment by fbarch...@chromium.org on 7 Mar 2013 at 7:54

GoogleCodeExporter commented 9 years ago

Done in r595.

Before
>out\release\libyuv_unittest --gtest_filter=*ABGRToARGB*   | sed 
"s/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g"   |
c:\cygwin\bin\sort -rn   | grep ms
1881 - [       OK ] libyuvTest.ABGRToARGB_Unaligned (1881 ms)
293 - [       OK ] libyuvTest.ABGRToARGB_Any (293 ms)
290 - [       OK ] libyuvTest.ABGRToARGB_Invert (290 ms)
289 - [       OK ] libyuvTest.ABGRToARGB_Opt (289 ms)
281 - [       OK ] libyuvTest.ABGRToARGB_Random (281 ms)
[==========] 5 tests from 1 test case ran. (3034 ms total)

After
>out\release\libyuv_unittest --gtest_filter=*ABGRToARGB*   | sed 
"s/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g"   |
c:\cygwin\bin\sort -rn   | grep ms
306 - [       OK ] libyuvTest.ABGRToARGB_Unaligned (306 ms)
295 - [       OK ] libyuvTest.ABGRToARGB_Invert (295 ms)
291 - [       OK ] libyuvTest.ABGRToARGB_Any (291 ms)
290 - [       OK ] libyuvTest.ABGRToARGB_Opt (290 ms)
273 - [       OK ] libyuvTest.ABGRToARGB_Random (273 ms)
[==========] 5 tests from 1 test case ran. (1455 ms total)

Original comment by fbarch...@chromium.org on 8 Mar 2013 at 1:40

Changed state: Started

GoogleCodeExporter commented 9 years ago

Thanks for the quick merge.
The intention for doing this is from the test since ARGB/ABGR unaligned cases 
are identified far slower than the others.
AVX2 would be the next step.

Original comment by changjun...@intel.com on 8 Mar 2013 at 1:53

GoogleCodeExporter commented 9 years ago

I've written a more general ARGBShuffler for AVX2
 https://webrtc-codereview.appspot.com/1171006

Original comment by fbarch...@chromium.org on 8 Mar 2013 at 3:51

GoogleCodeExporter commented 9 years ago

Fixed in r596.
Rewrote BGRAToARGB, ABGRToARGB, RGBAToARGB and ARGBToRGBA to use ARGBShuffle - 
less code, more variations.
Added AVX2
Added Unaligned_SSSE3
Any variations for SSSE3, AVX2 and Neon
Row coalescing - treat as width * height, 1 for contiguous rows.
Unrolled to do 2 at a time.

Sandy Bridge performance
BGRAToARGB_Any (272 ms)
BGRAToARGB_Unaligned (281 ms)
BGRAToARGB_Invert (283 ms)
BGRAToARGB_Opt (272 ms)
BGRAToARGB_Random (279 ms)
ABGRToARGB_Any (264 ms)
ABGRToARGB_Unaligned (268 ms)
ABGRToARGB_Invert (265 ms)
ABGRToARGB_Opt (252 ms)
ABGRToARGB_Random (264 ms)
RGBAToARGB_Any (268 ms)
RGBAToARGB_Unaligned (280 ms)
RGBAToARGB_Invert (275 ms)
RGBAToARGB_Opt (260 ms)
RGBAToARGB_Random (257 ms)
ARGBToRGBA_Any (258 ms)
ARGBToRGBA_Unaligned (265 ms)
ARGBToRGBA_Invert (267 ms)
ARGBToRGBA_Opt (260 ms)
ARGBToRGBA_Random (265 ms)

Original comment by fbarch...@chromium.org on 8 Mar 2013 at 11:37

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

doh!  ios error:
row_neon.cc:1190:3: error: expected string literal
     : "cc", "memory", "q0", "d2" // Clobber List
     ^
    1 error generated.

Original comment by fbarch...@chromium.org on 9 Mar 2013 at 12:27

Changed state: Started

GoogleCodeExporter commented 9 years ago

fixed in r597
It was a , ARGBToBayer, which previously declared the shuffler wrong.

Original comment by fbarch...@chromium.org on 9 Mar 2013 at 12:33

Changed state: Fixed

watery01 / libyuv

Conversion between ARGB and ABGR - Optimize for SSSE3 #196