watery01 / libyuv

Automatically exported from code.google.com/p/libyuv
0 stars 0 forks source link

ARGBToI444 is slow #148

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
build\release\libyuv_unittest --gtest_filter=*444*

SSSE3
ARGBToI444_Opt (18268 ms)

NEON
ARGBToI444_Opt (9570 ms)

Original issue reported on code.google.com by fbarch...@chromium.org on 4 Nov 2012 at 4:45

GoogleCodeExporter commented 9 years ago
Improved for next release

SSSE3
ARGBToI444_Unaligned (4765 ms)
ARGBToI444_Any (4444 ms)
ARGBToI444_Invert (4336 ms)
ARGBToI444_Opt (4277 ms)

NEON
ARGBToI444_Invert (9654 ms)
ARGBToI444_Opt (9544 ms)
ARGBToI444_Any (9525 ms)
ARGBToI444_Unaligned (9516 ms)

Original comment by fbarch...@chromium.org on 4 Nov 2012 at 5:04

GoogleCodeExporter commented 9 years ago
Windows version still slow.  More work needed:
ARGBToI444_Opt (18710 ms)
ARGBToI411_Opt (5147 ms)
ARGBToI422_Opt (861 ms)
ARGBToI420_Opt (650 ms)
ARGBToI400_Opt (451 ms)

Original comment by fbarch...@chromium.org on 4 Nov 2012 at 5:30

GoogleCodeExporter commented 9 years ago
Started
3671 - [  FAILED  ] libyuvTest.ARGBToI444_Opt (3671 ms)

Original comment by fbarch...@google.com on 6 Nov 2012 at 9:47

GoogleCodeExporter commented 9 years ago
With Y channel in Neon (r474)
ARGBToI444_Opt (9511 ms)

With UV channel also in Neon
ARGBToI444_Opt (3652 ms)

Original comment by fbarch...@google.com on 6 Nov 2012 at 6:45

GoogleCodeExporter commented 9 years ago
Any (odd width) versions fixed in r480

Was (r479)
ARGBToI444_Any (9656 ms)
ARGBToI422_Any (7639 ms)
ARGBToI411_Any (5598 ms)
ARGBToI420_Any (5713 ms)

Now
ARGBToI444_Any (3674 ms)
ARGBToI422_Any (3475 ms)
ARGBToI420_Any (3129 ms)
ARGBToI411_Any (3120 ms)
ARGBToI400_Any (1851 ms)

Original comment by fbarch...@chromium.org on 9 Nov 2012 at 10:45

GoogleCodeExporter commented 9 years ago
This issue looks is not full "fixed", ARGBToI444 seems still slow under X86 
platforms, here I attached a patch for your reference, the performance gain 
from one of the ivy bridge platforms after leveraging this patch is listed as 
below:

Was (r537)
ARGBToI444_Any (2980 ms)
ARGBToI444_Unaligned (2979 ms)
ARGBToI444_Invert (3044 ms)
ARGBToI444_Opt (2956 ms)

Now
ARGBToI444_Any (692 ms)
ARGBToI444_Unaligned (716 ms)
ARGBToI444_Invert (763 ms)
ARGBToI444_Opt (665 ms)

Original comment by changjun...@intel.com on 15 Jan 2013 at 9:24

Attachments:

GoogleCodeExporter commented 9 years ago
Good catch.  On quick review the CL looks right.
There will need to be a posix port, but thats fine.

Original comment by fbarch...@google.com on 15 Jan 2013 at 10:27

GoogleCodeExporter commented 9 years ago
A quick benchmark (other things running) shows ARGBToI444 as one of the slower 
functions. See below

c:\src\libyuv\trunk>set LIBYUV_REPEAT=1000
c:\src\libyuv\trunk>out\release\libyuv_unittest --gtest_filter=*   | sed 
"s/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g"   | \cygwin\bin\sort -rn   | grep ms
16997 - [       OK ] libyuvTest.BenchmarkSsim_Opt (16997 ms)
7407 - [       OK ] libyuvTest.ARGBScaleTo1366x768_Bilinear (7407 ms)
7175 - [       OK ] libyuvTest.ARGBRotate90 (7175 ms)
7132 - [       OK ] libyuvTest.ARGBRotate270 (7132 ms)
4754 - [       OK ] libyuvTest.ARGBRotate270_Odd (4754 ms)
4717 - [       OK ] libyuvTest.ARGBRotate90_Odd (4717 ms)
3828 - [       OK ] libyuvTest.ARGBToI444_Opt (3828 ms)
3826 - [       OK ] libyuvTest.ARGBToI444_Invert (3826 ms)
3824 - [       OK ] libyuvTest.ARGBToI444_Any (3824 ms)
3812 - [       OK ] libyuvTest.ARGBToI444_Unaligned (3812 ms)
3809 - [       OK ] libyuvTest.ARGBScaleDownBy34_Bilinear (3809 ms)
3155 - [       OK ] libyuvTest.ARGBInterpolate255_Any (3155 ms)
3151 - [       OK ] libyuvTest.ARGBInterpolate0_Any (3151 ms)
3150 - [       OK ] libyuvTest.ARGBInterpolate192_Any (3150 ms)
3143 - [       OK ] libyuvTest.ARGBInterpolate64_Any (3143 ms)
3135 - [       OK ] libyuvTest.ARGBInterpolate128_Any (3135 ms)
3117 - [       OK ] libyuvTest.ARGBInterpolate0_Unaligned (3117 ms)
3108 - [       OK ] libyuvTest.ARGBInterpolate255_Unaligned (3108 ms)
3100 - [       OK ] libyuvTest.ARGBInterpolate192_Unaligned (3100 ms)
3084 - [       OK ] libyuvTest.ARGBInterpolate64_Unaligned (3084 ms)
3069 - [       OK ] libyuvTest.ARGBInterpolate128_Unaligned (3069 ms)
3049 - [       OK ] libyuvTest.ARGBAttenuate_Unaligned (3049 ms)
2970 - [       OK ] libyuvTest.ARGBScaleTo853x480_Bilinear (2970 ms)
2314 - [       OK ] libyuvTest.ScaleTo1366x768_Bilinear (2314 ms)
2312 - [       OK ] libyuvTest.ScaleTo1366x768_Box (2312 ms)
2168 - [       OK ] libyuvTest.TestARGBColorTable (2168 ms)
2002 - [       OK ] libyuvTest.ARGBToRGBA_Unaligned (2002 ms)
1947 - [       OK ] libyuvTest.BGRAToARGB_Unaligned (1947 ms)
1946 - [       OK ] libyuvTest.ARGBToBGRA_Unaligned (1946 ms)
1944 - [       OK ] libyuvTest.RGBAToARGB_Unaligned (1944 ms)
1944 - [       OK ] libyuvTest.ARGBToABGR_Unaligned (1944 ms)
1943 - [       OK ] libyuvTest.ABGRToARGB_Unaligned (1943 ms)
1837 - [       OK ] libyuvTest.BayerGRBGToI420_Invert (1837 ms)
1835 - [       OK ] libyuvTest.BayerRGGBToI420_Invert (1835 ms)
1834 - [       OK ] libyuvTest.BayerGRBGToI420_Opt (1834 ms)
1834 - [       OK ] libyuvTest.BayerBGGRToI420_Any (1834 ms)
1831 - [       OK ] libyuvTest.BayerRGGBToI420_Any (1831 ms)
1830 - [       OK ] libyuvTest.BayerRGGBToI420_Unaligned (1830 ms)
1830 - [       OK ] libyuvTest.BayerGRBGToI420_Unaligned (1830 ms)
1825 - [       OK ] libyuvTest.BayerGRBGToI420_Any (1825 ms)
1822 - [       OK ] libyuvTest.BayerBGGRToI420_Unaligned (1822 ms)
1817 - [       OK ] libyuvTest.BayerBGGRToI420_Opt (1817 ms)
1815 - [       OK ] libyuvTest.BayerBGGRToI420_Invert (1815 ms)
1814 - [       OK ] libyuvTest.BayerRGGBToI420_Opt (1814 ms)
1813 - [       OK ] libyuvTest.BayerGBRGToI420_Invert (1813 ms)
1805 - [       OK ] libyuvTest.BayerGBRGToI420_Opt (1805 ms)
1803 - [       OK ] libyuvTest.BayerGBRGToI420_Unaligned (1803 ms)
1793 - [       OK ] libyuvTest.BayerGBRGToI420_Any (1793 ms)
1757 - [       OK ] libyuvTest.I420ToI444_Unaligned (1757 ms)
1749 - [       OK ] libyuvTest.I420ToI444_Invert (1749 ms)
1745 - [       OK ] libyuvTest.I420ToI444_Opt (1745 ms)
1740 - [       OK ] libyuvTest.BayerRGGBToARGB_Invert (1740 ms)
1738 - [       OK ] libyuvTest.BayerGBRGToARGB_Invert (1738 ms)
1736 - [       OK ] libyuvTest.BayerRGGBToARGB_Any (1736 ms)
1735 - [       OK ] libyuvTest.BayerRGGBToARGB_Opt (1735 ms)
1734 - [       OK ] libyuvTest.BayerRGGBToARGB_Unaligned (1734 ms)
1734 - [       OK ] libyuvTest.BayerGBRGToARGB_Opt (1734 ms)
1733 - [       OK ] libyuvTest.BayerGBRGToARGB_Unaligned (1733 ms)
1731 - [       OK ] libyuvTest.BayerGBRGToARGB_Any (1731 ms)
1706 - [       OK ] libyuvTest.I420ToI444_Any (1706 ms)
1672 - [       OK ] libyuvTest.BayerGRBGToARGB_Invert (1672 ms)
1672 - [       OK ] libyuvTest.BayerBGGRToARGB_Invert (1672 ms)
1669 - [       OK ] libyuvTest.BayerGRBGToARGB_Opt (1669 ms)
1668 - [       OK ] libyuvTest.BayerGRBGToARGB_Unaligned (1668 ms)
1667 - [       OK ] libyuvTest.BayerBGGRToARGB_Unaligned (1667 ms)
1667 - [       OK ] libyuvTest.BayerBGGRToARGB_Opt (1667 ms)
1667 - [       OK ] libyuvTest.BayerBGGRToARGB_Any (1667 ms)
1665 - [       OK ] libyuvTest.BayerGRBGToARGB_Any (1665 ms)
1554 - [       OK ] libyuvTest.ARGBToI411_Unaligned (1554 ms)
1535 - [       OK ] libyuvTest.ARGBToI411_Any (1535 ms)
1526 - [       OK ] libyuvTest.ARGBToI411_Opt (1526 ms)
1496 - [       OK ] libyuvTest.ARGBToI411_Invert (1496 ms)
1496 - [       OK ] libyuvTest.ARGBToARGB1555_Unaligned (1496 ms)
1486 - [       OK ] libyuvTest.ARGBToARGB4444_Unaligned (1486 ms)
1452 - [       OK ] libyuvTest.ScaleTo1366x768_None (1452 ms)
1272 - [       OK ] libyuvTest.ARGBToUYVY_Any (1272 ms)
1263 - [       OK ] libyuvTest.ARGBToYUY2_Any (1263 ms)
1181 - [       OK ] libyuvTest.ARGBToRGB24_Unaligned (1181 ms)
1178 - [       OK ] libyuvTest.ARGBToRGB24_Any (1178 ms)
1161 - [       OK ] libyuvTest.ARGBToRAW_Unaligned (1161 ms)
1152 - [       OK ] libyuvTest.ARGBToRAW_Any (1152 ms)
1140 - [       OK ] libyuvTest.ARGBToRGB565_Unaligned (1140 ms)
1072 - [       OK ] libyuvTest.I420ToARGB1555_Any (1072 ms)
1072 - [       OK ] libyuvTest.ARGBScaleDownBy38_Bilinear (1072 ms)
1049 - [       OK ] libyuvTest.I420ToARGB1555_Unaligned (1049 ms)
1036 - [       OK ] libyuvTest.I420ToARGB1555_Opt (1036 ms)
1035 - [       OK ] libyuvTest.I420ToARGB1555_Invert (1035 ms)
1003 - [       OK ] libyuvTest.NV21ToRGB565_Any (1003 ms)
1000 - [       OK ] libyuvTest.NV12ToRGB565_Any (1000 ms)
988 - [       OK ] libyuvTest.I420ToRGB565_Any (988 ms)

Original comment by fbarch...@google.com on 15 Jan 2013 at 11:47

GoogleCodeExporter commented 9 years ago
Is above benchmark data collected before or after merging the patch? And need 
any further actions I take?

Original comment by changjun...@intel.com on 16 Jan 2013 at 2:05

GoogleCodeExporter commented 9 years ago
That was before.  ARGBToI444 was near the top.  This is after
d:\src\libyuv\trunk>out\release\libyuv_unittest --gtest_filter=*   | sed 
"s/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g"   | c:\cygwin\b
in\sort -rn   | grep ms
17524 - [       OK ] libyuvTest.BenchmarkSsim_Opt (17524 ms)
7375 - [       OK ] libyuvTest.ARGBScaleTo1366x768_Bilinear (7375 ms)
6882 - [       OK ] libyuvTest.ARGBRotate270 (6882 ms)
6807 - [       OK ] libyuvTest.ARGBRotate90 (6807 ms)
4666 - [       OK ] libyuvTest.ARGBRotate270_Odd (4666 ms)
4652 - [       OK ] libyuvTest.ARGBRotate90_Odd (4652 ms)
3781 - [       OK ] libyuvTest.ARGBScaleDownBy34_Bilinear (3781 ms)
3048 - [       OK ] libyuvTest.ARGBInterpolate64_Unaligned (3048 ms)
3001 - [       OK ] libyuvTest.ARGBInterpolate128_Any (3001 ms)
2960 - [       OK ] libyuvTest.ARGBInterpolate192_Any (2960 ms)
2951 - [       OK ] libyuvTest.ARGBInterpolate64_Any (2951 ms)
2947 - [       OK ] libyuvTest.ARGBInterpolate255_Any (2947 ms)
2937 - [       OK ] libyuvTest.ARGBScaleTo853x480_Bilinear (2937 ms)
2924 - [       OK ] libyuvTest.ARGBInterpolate128_Unaligned (2924 ms)
2914 - [       OK ] libyuvTest.ARGBInterpolate192_Unaligned (2914 ms)
2911 - [       OK ] libyuvTest.ARGBInterpolate255_Unaligned (2911 ms)
2817 - [       OK ] libyuvTest.ARGBInterpolate0_Any (2817 ms)
2806 - [       OK ] libyuvTest.ARGBInterpolate0_Unaligned (2806 ms)
2718 - [       OK ] libyuvTest.ARGBAttenuate_Unaligned (2718 ms)
2300 - [       OK ] libyuvTest.ScaleTo1366x768_Box (2300 ms)
2297 - [       OK ] libyuvTest.ScaleTo1366x768_Bilinear (2297 ms)
1984 - [       OK ] libyuvTest.ARGBToRGBA_Unaligned (1984 ms)
1959 - [       OK ] libyuvTest.TestARGBColorTable (1959 ms)
1939 - [       OK ] libyuvTest.RGBAToARGB_Unaligned (1939 ms)
1937 - [       OK ] libyuvTest.BGRAToARGB_Unaligned (1937 ms)
1937 - [       OK ] libyuvTest.ABGRToARGB_Unaligned (1937 ms)
1936 - [       OK ] libyuvTest.ARGBToABGR_Unaligned (1936 ms)
1931 - [       OK ] libyuvTest.ARGBToBGRA_Unaligned (1931 ms)
1858 - [       OK ] libyuvTest.BayerGRBGToI420_Any (1858 ms)
1850 - [       OK ] libyuvTest.BayerGRBGToI420_Unaligned (1850 ms)
1845 - [       OK ] libyuvTest.BayerBGGRToI420_Any (1845 ms)
1842 - [       OK ] libyuvTest.BayerRGGBToI420_Invert (1842 ms)
1840 - [       OK ] libyuvTest.BayerRGGBToI420_Any (1840 ms)
1837 - [       OK ] libyuvTest.BayerRGGBToI420_Opt (1837 ms)
1836 - [       OK ] libyuvTest.BayerBGGRToI420_Invert (1836 ms)
1833 - [       OK ] libyuvTest.BayerRGGBToI420_Unaligned (1833 ms)
1833 - [       OK ] libyuvTest.BayerGRBGToI420_Invert (1833 ms)
1832 - [       OK ] libyuvTest.BayerGBRGToI420_Invert (1832 ms)
1832 - [       OK ] libyuvTest.BayerBGGRToI420_Unaligned (1832 ms)
1831 - [       OK ] libyuvTest.BayerGRBGToI420_Opt (1831 ms)
1831 - [       OK ] libyuvTest.BayerGBRGToI420_Any (1831 ms)
1830 - [       OK ] libyuvTest.BayerGBRGToI420_Opt (1830 ms)
1826 - [       OK ] libyuvTest.BayerGBRGToI420_Unaligned (1826 ms)
1826 - [       OK ] libyuvTest.BayerBGGRToI420_Opt (1826 ms)
1798 - [       OK ] libyuvTest.I420ToI444_Unaligned (1798 ms)
1759 - [       OK ] libyuvTest.I420ToI444_Invert (1759 ms)
1746 - [       OK ] libyuvTest.I420ToI444_Opt (1746 ms)
1735 - [       OK ] libyuvTest.I420ToI444_Any (1735 ms)
1712 - [       OK ] libyuvTest.BayerRGGBToARGB_Unaligned (1712 ms)
1708 - [       OK ] libyuvTest.BayerRGGBToARGB_Any (1708 ms)
1663 - [       OK ] libyuvTest.BayerRGGBToARGB_Invert (1663 ms)
1657 - [       OK ] libyuvTest.BayerBGGRToARGB_Invert (1657 ms)
1653 - [       OK ] libyuvTest.BayerBGGRToARGB_Unaligned (1653 ms)
1646 - [       OK ] libyuvTest.BayerBGGRToARGB_Any (1646 ms)
1642 - [       OK ] libyuvTest.BayerBGGRToARGB_Opt (1642 ms)
1614 - [       OK ] libyuvTest.BayerRGGBToARGB_Opt (1614 ms)
1596 - [       OK ] libyuvTest.BayerGBRGToARGB_Any (1596 ms)
1561 - [       OK ] libyuvTest.BayerGBRGToARGB_Unaligned (1561 ms)
1561 - [       OK ] libyuvTest.BayerGBRGToARGB_Invert (1561 ms)
1558 - [       OK ] libyuvTest.BayerGBRGToARGB_Opt (1558 ms)
1537 - [       OK ] libyuvTest.ARGBToI411_Unaligned (1537 ms)
1519 - [       OK ] libyuvTest.ARGBToI411_Any (1519 ms)
1514 - [       OK ] libyuvTest.ARGBToI411_Invert (1514 ms)
1510 - [       OK ] libyuvTest.ARGBToI411_Opt (1510 ms)
1497 - [       OK ] libyuvTest.BayerGRBGToARGB_Unaligned (1497 ms)
1494 - [       OK ] libyuvTest.BayerGRBGToARGB_Invert (1494 ms)
1490 - [       OK ] libyuvTest.ARGBToARGB1555_Unaligned (1490 ms)
1489 - [       OK ] libyuvTest.BayerGRBGToARGB_Opt (1489 ms)
1489 - [       OK ] libyuvTest.BayerGRBGToARGB_Any (1489 ms)
1481 - [       OK ] libyuvTest.ARGBToARGB4444_Unaligned (1481 ms)
1440 - [       OK ] libyuvTest.ScaleTo1366x768_None (1440 ms)
1274 - [       OK ] libyuvTest.ARGBToUYVY_Any (1274 ms)
1269 - [       OK ] libyuvTest.ARGBToYUY2_Any (1269 ms)
1178 - [       OK ] libyuvTest.ARGBToRGB24_Unaligned (1178 ms)
1174 - [       OK ] libyuvTest.ARGBToRGB24_Any (1174 ms)
1174 - [       OK ] libyuvTest.ARGBToRAW_Unaligned (1174 ms)
1173 - [       OK ] libyuvTest.ARGBToRAW_Any (1173 ms)
1139 - [       OK ] libyuvTest.ARGBToRGB565_Unaligned (1139 ms)
1077 - [       OK ] libyuvTest.ARGBScaleDownBy38_Bilinear (1077 ms)
1063 - [       OK ] libyuvTest.I420ToARGB1555_Any (1063 ms)
1046 - [       OK ] libyuvTest.NV21ToRGB565_Unaligned (1046 ms)
1039 - [       OK ] libyuvTest.I420ToARGB1555_Unaligned (1039 ms)
1027 - [       OK ] libyuvTest.I420ToARGB1555_Opt (1027 ms)
1015 - [       OK ] libyuvTest.I420ToARGB1555_Invert (1015 ms)
983 - [       OK ] libyuvTest.I420ToRGB565_Any (983 ms)
... with those that take 1 ms or more.

Note that ARGBToI411 is still slow.  But 444,422,420,400 are all fast.
Conversions ToI420 are more important, since everything needs to be I420 before 
encoding with VP8.
I420ToARGB comes up in rendering and effects.
One way to catch issues like ARGBToI444 is to run _Opt versions with a 
profiler, and looks for _C functions being called.

Original comment by fbarch...@google.com on 16 Jan 2013 at 3:52

GoogleCodeExporter commented 9 years ago
Confirmed its still slow on Linux

fbarchard@g36:/usr/local/google/libyuv/trunk$ runyuv10 ARGBToI*
LIBYUV_REPEAT=1000 out/Release/libyuv_unittest --gtest_filter=*ARGBToI* | sed 
's/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
2430 - [       OK ] libyuvTest.ARGBToI444_Any (2430 ms)
2357 - [       OK ] libyuvTest.ARGBToI444_Unaligned (2357 ms)
2331 - [       OK ] libyuvTest.ARGBToI444_Invert (2331 ms)
2299 - [       OK ] libyuvTest.ARGBToI444_Opt (2299 ms)
1158 - [       OK ] libyuvTest.ARGBToI411_Unaligned (1158 ms)
1138 - [       OK ] libyuvTest.ARGBToI411_Invert (1138 ms)
1126 - [       OK ] libyuvTest.ARGBToI411_Opt (1126 ms)
1121 - [       OK ] libyuvTest.ARGBToI411_Any (1121 ms)
537 - [       OK ] libyuvTest.ARGBToI422_Opt (537 ms)
507 - [       OK ] libyuvTest.ARGBToI422_Invert (507 ms)
484 - [       OK ] libyuvTest.ARGBToI422_Unaligned (484 ms)
434 - [       OK ] libyuvTest.ARGBToI422_Any (434 ms)
382 - [       OK ] libyuvTest.ARGBToI420_Unaligned (382 ms)
347 - [       OK ] libyuvTest.ARGBToI420_Any (347 ms)
344 - [       OK ] libyuvTest.ARGBToI420_Opt (344 ms)
336 - [       OK ] libyuvTest.ARGBToI420_Invert (336 ms)
265 - [       OK ] libyuvTest.ARGBToI400_Unaligned (265 ms)
242 - [       OK ] libyuvTest.ARGBToI400_Any (242 ms)
240 - [       OK ] libyuvTest.ARGBToI400_Invert (240 ms)
236 - [       OK ] libyuvTest.ARGBToI400_Opt (236 ms)
18 - [       OK ] libyuvTest.ARGBToI400_Random (18 ms)
[----------] 21 tests from libyuvTest (18332 ms total)
[==========] 21 tests from 1 test case ran. (18332 ms total)

Original comment by fbarch...@google.com on 16 Jan 2013 at 7:10

GoogleCodeExporter commented 9 years ago
r557 ports to Linux.  ARGBToI444 now optimized for SSSE3 and Neon.

Went from 2222ms to 465ms

Before:
LIBYUV_REPEAT=1000 out/Release/libyuv_unittest --gtest_filter=*ARGBToI4* | sed
's/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
2255 - [       OK ] libyuvTest.ARGBToI444_Unaligned (2255 ms)
2240 - [       OK ] libyuvTest.ARGBToI444_Any (2240 ms)
2222 - [       OK ] libyuvTest.ARGBToI444_Opt (2222 ms)
2221 - [       OK ] libyuvTest.ARGBToI444_Invert (2221 ms)
1092 - [       OK ] libyuvTest.ARGBToI411_Unaligned (1092 ms)
1076 - [       OK ] libyuvTest.ARGBToI411_Invert (1076 ms)
1072 - [       OK ] libyuvTest.ARGBToI411_Opt (1072 ms)
1072 - [       OK ] libyuvTest.ARGBToI411_Any (1072 ms)
462 - [       OK ] libyuvTest.ARGBToI422_Unaligned (462 ms)
418 - [       OK ] libyuvTest.ARGBToI422_Any (418 ms)
408 - [       OK ] libyuvTest.ARGBToI422_Opt (408 ms)
406 - [       OK ] libyuvTest.ARGBToI422_Invert (406 ms)
391 - [       OK ] libyuvTest.ARGBToI420_Any (391 ms)
365 - [       OK ] libyuvTest.ARGBToI420_Unaligned (365 ms)
320 - [       OK ] libyuvTest.ARGBToI420_Invert (320 ms)
317 - [       OK ] libyuvTest.ARGBToI420_Opt (317 ms)
252 - [       OK ] libyuvTest.ARGBToI400_Unaligned (252 ms)
231 - [       OK ] libyuvTest.ARGBToI400_Opt (231 ms)
229 - [       OK ] libyuvTest.ARGBToI400_Invert (229 ms)
228 - [       OK ] libyuvTest.ARGBToI400_Any (228 ms)
18 - [       OK ] libyuvTest.ARGBToI400_Random (18 ms)
[----------] 21 tests from libyuvTest (17296 ms total)

After:
LIBYUV_REPEAT=1000 out/Release/libyuv_unittest --gtest_filter=*ARGBToI4* | sed
's/\(.*(\)\([0-9]*\)\( ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
1097 - [       OK ] libyuvTest.ARGBToI411_Opt (1097 ms)
1095 - [       OK ] libyuvTest.ARGBToI411_Unaligned (1095 ms)
1091 - [       OK ] libyuvTest.ARGBToI411_Any (1091 ms)
1084 - [       OK ] libyuvTest.ARGBToI411_Invert (1084 ms)
524 - [       OK ] libyuvTest.ARGBToI444_Unaligned (524 ms)
472 - [       OK ] libyuvTest.ARGBToI444_Any (472 ms)
470 - [       OK ] libyuvTest.ARGBToI422_Unaligned (470 ms)
465 - [       OK ] libyuvTest.ARGBToI444_Opt (465 ms)
460 - [       OK ] libyuvTest.ARGBToI444_Invert (460 ms)
418 - [       OK ] libyuvTest.ARGBToI422_Any (418 ms)
408 - [       OK ] libyuvTest.ARGBToI422_Opt (408 ms)
405 - [       OK ] libyuvTest.ARGBToI422_Invert (405 ms)
398 - [       OK ] libyuvTest.ARGBToI420_Any (398 ms)
361 - [       OK ] libyuvTest.ARGBToI420_Unaligned (361 ms)
318 - [       OK ] libyuvTest.ARGBToI420_Invert (318 ms)
314 - [       OK ] libyuvTest.ARGBToI420_Opt (314 ms)
253 - [       OK ] libyuvTest.ARGBToI400_Unaligned (253 ms)
232 - [       OK ] libyuvTest.ARGBToI400_Any (232 ms)
231 - [       OK ] libyuvTest.ARGBToI400_Opt (231 ms)
230 - [       OK ] libyuvTest.ARGBToI400_Invert (230 ms)
19 - [       OK ] libyuvTest.ARGBToI400_Random (19 ms)
[----------] 21 tests from libyuvTest (10345 ms total)
[==========] 21 tests from 1 test case ran. (10345 ms total)

Original comment by fbarch...@google.com on 4 Feb 2013 at 7:11

GoogleCodeExporter commented 9 years ago
Reopening to consider an alternative implementation.
The current code mimics ARGBToI420 etc.  Do the Y plane, and then do subsampled 
U and V planes together.  The rational is the subsampling and source memory 
savings outweight the poor destination memory behavior.
Whereas the Y plane benefits from a single pass that does only the Y plane - 
its more efficient than a single function that writes to 3 destinations.
Since 444 is not subsampled, there is not savings.
The math for the U and V planes is much like the Y plane.  It may actually be 
possible to use the same code - ARGBToY, with different coefficients.
UV are signed, so probably it would be ARGBToU and ARGBToV with the current UV 
code broken into 2 functions.
Non-subsampled also lends itself to 'last16' method of handling odd width.  Do 
the first N*16 with SSSE3, and then the last 16 with unaligned SSSE3, 
overlapping on some.  But not needing C code.
Since the code is pretty fast, and not commonly used, I've reduced the priority.

Original comment by fbarch...@chromium.org on 8 Feb 2013 at 9:25

GoogleCodeExporter commented 9 years ago
Implemented the 3 plane approach

  for (int y = 0; y < height; ++y) {
//    ARGBToUV444Row(src_argb, dst_u, dst_v, width);
    ARGBToU444Row_SSSE3(src_argb, dst_u, width);
    ARGBToV444Row_SSSE3(src_argb, dst_v, width);
    ARGBToYRow(src_argb, dst_y, width);
    src_argb += src_stride_argb;
    dst_y += dst_stride_y;
    dst_u += dst_stride_u;
    dst_v += dst_stride_v;
  }

Performance is worse
Was ARGBToI444_Opt (515 ms)
Now ARGBToI444_Opt (543 ms)

Not a win, so will close the bug, but here is the code, incase its useful in 
future

// Convert 16 ARGB pixels (64 bytes) to 16 U values.
__declspec(naked) __declspec(align(16))
void ARGBToU444Row_SSSE3(const uint8* src_argb, uint8* dst_u, int width) {
  __asm {
    mov        eax, [esp + 4]   /* src_argb */
    mov        edx, [esp + 8]   /* dst_u */
    mov        ecx, [esp + 12]  /* width */
    movdqa     xmm5, kAddUV128
    movdqa     xmm4, kARGBToU

    align      16
 convertloop:
    movdqa     xmm0, [eax]
    movdqa     xmm1, [eax + 16]
    movdqa     xmm2, [eax + 32]
    movdqa     xmm3, [eax + 48]
    pmaddubsw  xmm0, xmm4
    pmaddubsw  xmm1, xmm4
    pmaddubsw  xmm2, xmm4
    pmaddubsw  xmm3, xmm4
    lea        eax, [eax + 64]
    phaddw     xmm0, xmm1
    phaddw     xmm2, xmm3
    psraw      xmm0, 8
    psraw      xmm2, 8
    packsswb   xmm0, xmm2
    paddb      xmm0, xmm5
    sub        ecx, 16
    movdqa     [edx], xmm0
    lea        edx, [edx + 16]
    jg         convertloop
    ret
  }
}

// Convert 16 ARGB pixels (64 bytes) to 16 V values.
__declspec(naked) __declspec(align(16))
void ARGBToV444Row_SSSE3(const uint8* src_argb, uint8* dst_v, int width) {
  __asm {
    mov        eax, [esp + 4]   /* src_argb */
    mov        edx, [esp + 8]   /* dst_v */
    mov        ecx, [esp + 12]  /* width */
    movdqa     xmm5, kAddUV128
    movdqa     xmm4, kARGBToV

    align      16
 convertloop:
    movdqa     xmm0, [eax]
    movdqa     xmm1, [eax + 16]
    movdqa     xmm2, [eax + 32]
    movdqa     xmm3, [eax + 48]
    pmaddubsw  xmm0, xmm4
    pmaddubsw  xmm1, xmm4
    pmaddubsw  xmm2, xmm4
    pmaddubsw  xmm3, xmm4
    lea        eax, [eax + 64]
    phaddw     xmm0, xmm1
    phaddw     xmm2, xmm3
    psraw      xmm0, 8
    psraw      xmm2, 8
    packsswb   xmm0, xmm2
    paddb      xmm0, xmm5
    sub        ecx, 16
    movdqa     [edx], xmm0
    lea        edx, [edx + 16]
    jg         convertloop
    ret
  }
}

I didnt try it, but glue could be done to allow Neon to use UV and SSSE3 to be 
planar.
void ARGBToUV444Row_SSSE3(const uint8* src_argb,uint8* dst_u, uint8* dst_v, int 
width) {
  ARGBToU444Row_SSSE3(src_argb, dst_u, width);
  ARGBToV444Row_SSSE3(src_argb, dst_v, width);
}

Original comment by fbarch...@google.com on 15 Feb 2013 at 7:29

GoogleCodeExporter commented 9 years ago
In the planar version I fixed a bug.  Since that wont be used, the current UV 
function needs the fix.  Use signed shift and pack:
    psraw      xmm0, 8
    psraw      xmm2, 8
    packsswb   xmm0, xmm2

Original comment by fbarch...@google.com on 15 Feb 2013 at 7:36

GoogleCodeExporter commented 9 years ago
Fixed in r575

Original comment by fbarch...@google.com on 15 Feb 2013 at 6:58