Closed GoogleCodeExporter closed 9 years ago
Thanks for the patch!
RAW is pretty rare, but the code should be 90% identical RGB24. Prefer do both
at once.
RAWtoARGB is performance important, for something like screencasting, but vice
versa is less important. Perhaps in rendering.
Opt is performance critical, but Unaligned is not. We can almost always
guarantee buffers are aligned. But for ARGB its occasionally going to come up
on clipping.
This patch does SSSE3, but not AVX2, so it hasn't improved performance over
ARGBToOpt, but doubled the number of low levels. I assume the slight slowdown
on Opt is due to extra overhead to check for unaligned vs aligned. We could
just do unaligned all the time, without introducing new code, which would hurt
performance on Core2 and Atom, but not hurt Sandy Bridge. But it might be more
constructive to do AVX or AVX2, which guarantees the Cpu does not slow down on
unaligned moves?
The patch only does Visual C and would need a gcc port.
While you were looking, did you see any potential for optimization?
The pshufb packs 16 byte to 12 and end swaps, so thats reasonably effective,
but I think it doesn't pipeline and could be re-ordered?
But the shifts to combine I'm not happy about. Perhaps put first 12 in upper
12, then 2nd 12 use low 4 and high 8. use a palignr to combine first 12 and
next 4.
Another issue with this function is there is no SSE2 version. A way to avoid
pshufb is unpack and use pshufw.
Or an ugly trick is read int's and write int's unaligned, overwriting the
alpha. No special requirements on CPU, and its faster than a byte at a time,
except memory subsystems may be slow or cause exceptions. Cache line splits,
page splits etc.
movbe could be used to do the end swap on Atom.
Original comment by fbarch...@chromium.org
on 17 May 2013 at 3:49
Proposed unaligned all the time patch firstly, both Windows and posix, AVX2 and
optimization would be the next step.
Original comment by changjun...@intel.com
on 20 May 2013 at 10:37
Attachments:
Okay, thats better, but it'll hurt performance on Core2 and not help on Sandy
Bridge for Opt.
If we have AVX2 code for Haswell, wouldnt it be best to keep the old code as
optimized for Core2 aligned?
Original comment by fbarch...@google.com
on 21 May 2013 at 1:21
Here is the code review
https://webrtc-codereview.appspot.com/1519004
Original comment by fbarch...@google.com
on 21 May 2013 at 1:27
patch updated for:
ARGBToRAW
ARGBToRGB24
ARGBToRGB565
ARGBToARGB1555
ARGBToARGB4444
Please help to attach. Thanks!
Original comment by changjun...@intel.com
on 21 May 2013 at 2:26
Attachments:
Updated for AVX version of ARGBToRAW on Windows.
Original comment by changjun...@intel.com
on 28 May 2013 at 9:04
Attachments:
Is there a performance improvement in the Opt case?
Original comment by fbarch...@google.com
on 28 May 2013 at 4:43
Uploaded for review https://webrtc-codereview.appspot.com/1578004
Original comment by fbarch...@google.com
on 28 May 2013 at 6:21
Fixed in r785
Was
libyuvTest.ARGBToRAW_Unaligned (1096 ms)
libyuvTest.ARGBToRAW_Any (338 ms)
libyuvTest.ARGBToRAW_Invert (333 ms)
libyuvTest.ARGBToRAW_Opt (329 ms)
libyuvTest.ARGBToRGB24_Unaligned (1107 ms)
libyuvTest.ARGBToRGB24_Invert (340 ms)
libyuvTest.ARGBToRGB24_Opt (333 ms)
libyuvTest.ARGBToRGB24_Any (329 ms)
Now
ARGBToRAW_Unaligned (335 ms)
ARGBToRAW_Any (335 ms)
ARGBToRAW_Invert (329 ms)
ARGBToRAW_Opt (326 ms)
ARGBToRGB24_Unaligned (338 ms)
ARGBToRGB24_Any (335 ms)
ARGBToRGB24_Invert (334 ms)
ARGBToRGB24_Opt (328 ms)
Original comment by fbarch...@google.com
on 11 Sep 2013 at 1:51
Original issue reported on code.google.com by
changjun...@intel.com
on 17 May 2013 at 1:38Attachments: