Optimization for ARGBToRAW_Unaligned

GoogleCodeExporter commented 9 years ago

We observed that ARGBToRAW_Unaligned shows poor performance, here propose a 
patch to improve.
Test results on Windows platform:

Before(r695):
ARGBToRAW_Any (539 ms)
ARGBToRAW_Unaligned (1660 ms)
ARGBToRAW_Invert (561 ms)
ARGBToRAW_Opt (537 ms)
ARGBToRAW_Random (63 ms)

After:
ARGBToRAW_Any (536 ms)
ARGBToRAW_Unaligned (545 ms)
ARGBToRAW_Invert (564 ms)
ARGBToRAW_Opt (541 ms)
ARGBToRAW_Random (59 ms)

Original issue reported on code.google.com by changjun...@intel.com on 17 May 2013 at 1:38

Attachments:

ARGBToRAW_Unaligned.patch

GoogleCodeExporter commented 9 years ago

Thanks for the patch!

RAW is pretty rare, but the code should be 90% identical RGB24.  Prefer do both 
at once.

RAWtoARGB is performance important, for something like screencasting, but vice 
versa is less important.  Perhaps in rendering.

Opt is performance critical, but Unaligned is not.  We can almost always 
guarantee buffers are aligned.  But for ARGB its occasionally going to come up 
on clipping.

This patch does SSSE3, but not AVX2, so it hasn't improved performance over 
ARGBToOpt, but doubled the number of low levels.  I assume the slight slowdown 
on Opt is due to extra overhead to check for unaligned vs aligned. We could 
just do unaligned all the time, without introducing new code, which would hurt 
performance on Core2 and Atom, but not hurt Sandy Bridge.  But it might be more 
constructive to do AVX or AVX2, which guarantees the Cpu does not slow down on 
unaligned moves?

The patch only does Visual C and would need a gcc port.

While you were looking, did you see any potential for optimization?
The pshufb packs 16 byte to 12 and end swaps, so thats reasonably effective, 
but I think it doesn't pipeline and could be re-ordered?
But the shifts to combine I'm not happy about.  Perhaps put first 12 in upper 
12, then 2nd 12 use low 4 and high 8.  use a palignr to combine first 12 and 
next 4.

Another issue with this function is there is no SSE2 version.  A way to avoid 
pshufb is unpack and use pshufw.

Or an ugly trick is read int's and write int's unaligned, overwriting the 
alpha.  No special requirements on CPU, and its faster than a byte at a time, 
except memory subsystems may be slow or cause exceptions.  Cache line splits, 
page splits etc.
movbe could be used to do the end swap on Atom.

Original comment by fbarch...@chromium.org on 17 May 2013 at 3:49

GoogleCodeExporter commented 9 years ago

Proposed unaligned all the time patch firstly, both Windows and posix, AVX2 and 
optimization would be the next step.

Original comment by changjun...@intel.com on 20 May 2013 at 10:37

Attachments:

unaligned.patch

GoogleCodeExporter commented 9 years ago

Okay, thats better, but it'll hurt performance on Core2 and not help on Sandy 
Bridge for Opt.
If we have AVX2 code for Haswell, wouldnt it be best to keep the old code as 
optimized for Core2 aligned?

Original comment by fbarch...@google.com on 21 May 2013 at 1:21

GoogleCodeExporter commented 9 years ago

Here is the code review
https://webrtc-codereview.appspot.com/1519004

Original comment by fbarch...@google.com on 21 May 2013 at 1:27

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

patch updated for:
ARGBToRAW
ARGBToRGB24
ARGBToRGB565
ARGBToARGB1555
ARGBToARGB4444

Please help to attach. Thanks!

Original comment by changjun...@intel.com on 21 May 2013 at 2:26

Attachments:

unaligned_all.patch

GoogleCodeExporter commented 9 years ago

Updated for AVX version of ARGBToRAW on Windows.

Original comment by changjun...@intel.com on 28 May 2013 at 9:04

Attachments:

ARGBToRAW_AVX.patch

GoogleCodeExporter commented 9 years ago

Is there a performance improvement in the Opt case?

Original comment by fbarch...@google.com on 28 May 2013 at 4:43

GoogleCodeExporter commented 9 years ago

Uploaded for review https://webrtc-codereview.appspot.com/1578004

Original comment by fbarch...@google.com on 28 May 2013 at 6:21

GoogleCodeExporter commented 9 years ago

Fixed in r785
Was
libyuvTest.ARGBToRAW_Unaligned (1096 ms)
libyuvTest.ARGBToRAW_Any (338 ms)
libyuvTest.ARGBToRAW_Invert (333 ms)
libyuvTest.ARGBToRAW_Opt (329 ms)
libyuvTest.ARGBToRGB24_Unaligned (1107 ms)
libyuvTest.ARGBToRGB24_Invert (340 ms)
libyuvTest.ARGBToRGB24_Opt (333 ms)
libyuvTest.ARGBToRGB24_Any (329 ms)

Now
ARGBToRAW_Unaligned (335 ms)
ARGBToRAW_Any (335 ms)
ARGBToRAW_Invert (329 ms)
ARGBToRAW_Opt (326 ms)
ARGBToRGB24_Unaligned (338 ms)
ARGBToRGB24_Any (335 ms)
ARGBToRGB24_Invert (334 ms)
ARGBToRGB24_Opt (328 ms)

Original comment by fbarch...@google.com on 11 Sep 2013 at 1:51

Changed state: Fixed

watery01 / libyuv

Optimization for ARGBToRAW_Unaligned #230