watery01 / libyuv

Automatically exported from code.google.com/p/libyuv
0 stars 0 forks source link

YUV420ToRGB565 function optimized for NEON #103

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
http://stackoverflow.com/questions/12184504/does-libyuv-have-a-yuv420torgb565-fu
nction-optimized-for-neon

Does libyuv have a YUV420ToRGB565 function optimized for NEON?

From what I see in libyuv sources there's a function I420ToRGB565 but it first 
converts to ARGB and only then to RGB565 and that last conversion is not 
NEON-optimized. 

Original issue reported on code.google.com by fbarch...@chromium.org on 29 Sep 2012 at 6:51

GoogleCodeExporter commented 9 years ago
The complete list of RGB formats supported at this time is
ARGB, BGRA, ABGR, RGBA - I420/I422 directly convert to these.
RGB24, RAW - 2 step.  I420ToARGB, ARGBToRGB24.
RGB565, ARGB1555, ARGB4444 - 2 step.  I420ToARGB, ARGBToRGB565 etc.

On Neon it would be easy to do I420ToRGB24.  On SSSE3, its not at all easy, and 
needs to do 16 pixels - 64 bytes of source, to produce 48 bytes of destination, 
to be fully efficient.  A compromise might be a single row function that does 2 
steps on SSSE3 and 1 step on Neon.

The RGB565 conversion is difficult for both SSSE3 and Neon.  The internals of 
I420ToARGB use SOA style - 3 registers, R, G, B.  While an optimized 
ARGBToRGB565 prefers AOS - one register containing 4 pixels of ARGB.  On Intel, 
the packing converts SOA to AOS, and the RGB565 could be done, except it would 
run out of registers in 32 bit.  The performance would be roughly the same, but 
avoid some calling overhead.  Which is not significant on x86, but more so on 
Arm.
For NEON the 565 conversion will need AOS conversion and new 565 packing, which 
Neon wont do especially well.  mask, shift and combine.
A first step is ARGBToRGB565_NEON.  second step is I420ToRGB565 row function 
that calls both Neon functions.  third step is add RGB565 output to I420ToRGB 
Neon code.

Original comment by fbarch...@chromium.org on 30 Sep 2012 at 7:10

GoogleCodeExporter commented 9 years ago
r397 does RGB24/RAW in 1 step for Neon.

Original comment by fbarch...@google.com on 9 Oct 2012 at 1:22

GoogleCodeExporter commented 9 years ago
r438 fixes test and Improved performance with 2 step Neon. NV12ToARGB_NEON and 
ARGBToRGB565_NEON

NV12ToRGB565_Opt (6249 ms)

Original comment by fbarch...@google.com on 23 Oct 2012 at 10:49

GoogleCodeExporter commented 9 years ago
6199 - [       OK ] libyuvTest.I420ToRGB565_Any (6199 ms)
6160 - [       OK ] libyuvTest.I420ToRGB565_Unaligned (6160 ms)
6126 - [       OK ] libyuvTest.I420ToRGB565_Opt (6126 ms)
6117 - [       OK ] libyuvTest.I420ToRGB565_Invert (6117 ms)

Original comment by fbarch...@google.com on 28 Oct 2012 at 9:56

GoogleCodeExporter commented 9 years ago
Arm in 1 step Neon conversion is
I420ToRGB565_Any (3979 ms)
I420ToRGB565_Unaligned (3914 ms)
I420ToRGB565_Invert (3885 ms)
I420ToRGB565_Opt (3884 ms)

On x86 the 2 step function is
I420ToRGB565_Any (1722 ms)
I420ToRGB565_Unaligned (1501 ms)
I420ToRGB565_Invert (1478 ms)
I420ToRGB565_Opt (1479 ms)

x86 1 step unoptimized
I420ToRGB565_Any (32512 ms)
I420ToRGB565_Unaligned (32808 ms)
I420ToRGB565_Invert (32666 ms)
I420ToRGB565_Opt (32469 ms)

x86 1 step optimized
I420ToRGB565_Any (1623 ms)
I420ToRGB565_Unaligned (1519 ms)
I420ToRGB565_Invert (1436 ms)
I420ToRGB565_Opt (1428 ms)

x86 does not benefit much from 1 step conversion, and the low level code is 
more complex.
But 1 step conversion does not require a row buffer.
mips version performance will regress unless optimized.
A row function could be done that calls I422ToARGB and then I422ToRGB565.

Original comment by fbarch...@google.com on 28 Oct 2012 at 5:51

GoogleCodeExporter commented 9 years ago
x86 done as row wrapper function that calls I422ToARGB then ARGBToRGB565.
I420ToRGB565_Any (1626 ms)
I420ToRGB565_Unaligned (1523 ms)
I420ToRGB565_Invert (1506 ms)
I420ToRGB565_Opt (1515 ms)

Original comment by fbarch...@google.com on 28 Oct 2012 at 6:09

GoogleCodeExporter commented 9 years ago
Fixed in r452

Linux
I420ToRGB565_Any (1600 ms)
I420ToRGB565_Unaligned (1641 ms)
I420ToRGB565_Invert (1468 ms)
I420ToRGB565_Opt (1465 ms)

Windows
I420ToRGB565_Any (1624 ms)
I420ToRGB565_Unaligned (1523 ms)
I420ToRGB565_Invert (1476 ms)
I420ToRGB565_Opt (1446 ms)

Mac
I420ToRGB565_Opt (2166 ms)
I420ToRGB565_Unaligned (2363 ms)
I420ToRGB565_Invert (2153 ms)
I420ToRGB565_Any (2519 ms)

Arm
I420ToRGB565_Any (4125 ms)
I420ToRGB565_Unaligned (4101 ms)
I420ToRGB565_Invert (4071 ms)
I420ToRGB565_Opt (4066 ms)

Original comment by fbarch...@chromium.org on 29 Oct 2012 at 6:17