Closed GoogleCodeExporter closed 9 years ago
To know if a libyuv function will be fast path, a rule of thumb is the width
needs to be a multiple of 16, and the image pointer/stride should also be
aligned to 16.
The pointer/stride alignment is less of a concern on Neon and AVX2, but most
functions are optimized for aligned width.
Your suggested solution is overread/overwrite. libyuv will allow you to do
that, and its a good solution.
Allocate extra pad bytes for rows and/or images.
Conversion functions will check if width == stride and treat the image as one
large row. You can do that yourself, and pad out the total (width * height +
15) & ~15;
Scaling can't be row coalesced, but you can allocate aligned rows.
Allocate buffers with stride = (width + 15) & ~15; and image_size = stride *
height;
In the case of scaling to 1/2, the destination needs to be a multiple of 16, so
source would be a multiple of 32.
I did experiment with overreads/overwrites, but it was deemed unsafe. So the 2
solutions I've come up with are 'any' functions, and row coalescing.
any functions on intel still prefer an aligned pointer, but handle 'any' width,
by doing the multiple of 16, and then handling the remainder. Most handle the
remainder using C code, but some functions redo work on the 'last16' pixels,
which is an overread/overwrite of data already processed, but within the row.
This is supported for conversions, but not scaling.
The unittests check for overread/write by allocating images at the end of a
page, and are run thru valgrind.
So the action item here is to implement scale_any.cc which has a wrapper for
each scale row function that handles odd sizes.
Its not hard, and it may even exist already for 1/2 size, since that comes up
in conversions/effects.
Original comment by fbarch...@chromium.org
on 21 Feb 2014 at 11:27
Best long term solution will be allow pointers to be unaligned - albiet slower,
and allow width and/or stride to be 'any'.
Another user suggested an 'overread' mode, which was tried in the past. Its
efficient but dangerous, so the 'any' approach is preferred.
Also row coalescing was added to allow contiguous images to be handled
efficiently.
Changing nature of this bug to efficient odd width scaling support.
Original comment by fbarch...@google.com
on 28 Jul 2014 at 10:01
ScalePlaneDown2 is also the highest on profiles for the scaler, and should be
AVX2 optimized.
Original comment by fbarch...@google.com
on 27 Nov 2014 at 1:49
r1345 supports odd width ScalePlaneDown2
set LIBYUV_WIDTH=1276
set LIBYUV_HEIGHT=720
set LIBYUV_REPEAT=3999
set LIBYUV_FLAGS=-1
out\release\libyuv_unittest_old --gtest_filter=*.ScaleDownBy2* | findstr ms
[ OK ] libyuvTest.ScaleDownBy2_None (1157 ms)
[ OK ] libyuvTest.ScaleDownBy2_Linear (2422 ms)
[ OK ] libyuvTest.ScaleDownBy2_Bilinear (2891 ms)
[ OK ] libyuvTest.ScaleDownBy2_Box (2891 ms)
out\release\libyuv_unittest --gtest_filter=*.ScaleDownBy2* | findstr ms
[ OK ] libyuvTest.ScaleDownBy2_None (422 ms)
[ OK ] libyuvTest.ScaleDownBy2_Linear (484 ms)
[ OK ] libyuvTest.ScaleDownBy2_Bilinear (625 ms)
[ OK ] libyuvTest.ScaleDownBy2_Box (625 ms)
set LIBYUV_WIDTH=1280
set LIBYUV_HEIGHT=720
set LIBYUV_REPEAT=3999
set LIBYUV_FLAGS=-1
out\release\libyuv_unittest --gtest_filter=*.ScaleDownBy2* | findstr ms
[ OK ] libyuvTest.ScaleDownBy2_None (343 ms)
[ OK ] libyuvTest.ScaleDownBy2_Linear (407 ms)
[ OK ] libyuvTest.ScaleDownBy2_Bilinear (500 ms)
[ OK ] libyuvTest.ScaleDownBy2_Box (500 ms)
Original comment by fbarch...@chromium.org
on 26 Mar 2015 at 6:11
ScalePlaneDown2 ported to AVX2
Was ScaleDownBy2_Box (500 ms)
Now ScaleDownBy2_Box (437 ms)
and for odd widths
Was ScaleDownBy2_Box (2890 ms)
Now ScaleDownBy2_Box (625 ms)
The known case where half size is slow is when its not exactly half.
set LIBYUV_WIDTH=1276
ScaleDownBy2_Box (752 ms)
ScaleDownBy2_Bilinear (741 ms)
ScaleDownBy2_None (666 ms)
ScaleDownBy2_Linear (628 ms)
set LIBYUV_WIDTH=1278
ScaleDownBy2_Bilinear (1712 ms)
ScaleDownBy2_None_16 (1510 ms)
ScaleDownBy2_Linear (1395 ms)
ScaleDownBy2_None (1086 ms)
This is because the chroma channel is an odd width, and the half size version
of it produces a scale factor that of 1.996875
Original comment by fbarch...@chromium.org
on 26 Mar 2015 at 10:55
r1390 does AVX2 and odd widths for ScaleDownBy4
set LIBYUV_WIDTH=1276
set LIBYUV_HEIGHT=720
set LIBYUV_REPEAT=3999
set LIBYUV_FLAGS=0
out\release\libyuv_unittest.exe --gtest_catch_exceptions=0
--gtest_filter=*.Scale*
odd set LIBYUV_WIDTH=1276
72 tests from 1 test case ran. (140435 ms total)
even set LIBYUV_WIDTH=1280
72 tests from 1 test case ran. (88639 ms total)
Original comment by fbarch...@google.com
on 30 Apr 2015 at 2:11
odd
72 tests from 1 test case ran. (340854 ms total)
even
72 tests from 1 test case ran. (293577 ms total)
Original comment by fbarch...@chromium.org
on 10 Jun 2015 at 9:25
scale down by 2 is running as expected now
odd
[ OK ] libyuvTest.ScaleDownBy2_None (924 ms)
[ OK ] libyuvTest.ScaleDownBy2_Linear (1146 ms)
[ OK ] libyuvTest.ScaleDownBy2_Bilinear (1430 ms)
[ OK ] libyuvTest.ScaleDownBy2_Box (1429 ms)
[----------] 4 tests from libyuvTest (4929 ms total)
even
[ OK ] libyuvTest.ScaleDownBy2_None (824 ms)
[ OK ] libyuvTest.ScaleDownBy2_Linear (1036 ms)
[ OK ] libyuvTest.ScaleDownBy2_Bilinear (1283 ms)
[ OK ] libyuvTest.ScaleDownBy2_Box (1102 ms)
[----------] 4 tests from libyuvTest (4246 ms total)
Original comment by fbarch...@chromium.org
on 10 Jun 2015 at 9:29
Original issue reported on code.google.com by
noah...@google.com
on 14 Feb 2014 at 7:54