Closed GoogleCodeExporter closed 8 years ago
Can any propose code or method that can implement in SIMD?
On both x86 and neon versions, I've isolated the code that converts 24 bit RGB
to 565.
This is from row_neon.cc:
#define ARGBTORGB565 \
"vshr.u8 d20, d20, #3 \n" /* B */ \
"vshr.u8 d21, d21, #2 \n" /* G */ \
"vshr.u8 d22, d22, #3 \n" /* R */ \
"vmovl.u8 q8, d20 \n" /* B */ \
"vmovl.u8 q9, d21 \n" /* G */ \
"vmovl.u8 q10, d22 \n" /* R */ \
"vshl.u16 q9, q9, #5 \n" /* G */ \
"vshl.u16 q10, q10, #11 \n" /* R */ \
"vorr q0, q8, q9 \n" /* BG */ \
"vorr q0, q0, q10 \n" /* BGR */
and the SSSE3 code
// Store 8 RGB565 values.
#define STORERGB565 __asm { \
/* Step 3: Weave into RRGB */ \
__asm punpcklbw xmm0, xmm1 /* BG */ \
__asm punpcklbw xmm2, xmm2 /* RR */ \
__asm movdqa xmm1, xmm0 \
__asm punpcklwd xmm0, xmm2 /* BGRR first 4 pixels */ \
__asm punpckhwd xmm1, xmm2 /* BGRR next 4 pixels */ \
/* Step 4: RRGB -> RGB565 */ \
__asm movdqa xmm3, xmm0 /* B first 4 pixels of argb */ \
__asm movdqa xmm2, xmm0 /* G */ \
__asm pslld xmm0, 8 /* R */ \
__asm psrld xmm3, 3 /* B */ \
__asm psrld xmm2, 5 /* G */ \
__asm psrad xmm0, 16 /* R */ \
__asm pand xmm3, xmm5 /* B */ \
__asm pand xmm2, xmm6 /* G */ \
__asm pand xmm0, xmm7 /* R */ \
__asm por xmm3, xmm2 /* BG */ \
__asm por xmm0, xmm3 /* BGR */ \
__asm movdqa xmm3, xmm1 /* B next 4 pixels of argb */ \
__asm movdqa xmm2, xmm1 /* G */ \
__asm pslld xmm1, 8 /* R */ \
__asm psrld xmm3, 3 /* B */ \
__asm psrld xmm2, 5 /* G */ \
__asm psrad xmm1, 16 /* R */ \
__asm pand xmm3, xmm5 /* B */ \
__asm pand xmm2, xmm6 /* G */ \
__asm pand xmm1, xmm7 /* R */ \
__asm por xmm3, xmm2 /* BG */ \
__asm por xmm1, xmm3 /* BGR */ \
__asm packssdw xmm0, xmm1 \
__asm movdqu [edx], xmm0 /* store 8 pixels of RGB565 */ \
__asm lea edx, [edx + 16] \
}
A first step might be a function that converts ARGBToRGB565 with dithering. If
it can be done a row at a time, the higher level could call I422ToARGBRow_NEON
then ARGBToRGB565Row_NEON with dithering and still be reasonably fast,
depending how the dithering is implemented.
Original comment by fbarch...@google.com
on 26 Feb 2015 at 6:57
FWIW I just optimized the AVX2 version of that:
Was
I420ToRGB565_Opt (1844 ms)
Now
I420ToRGB565_Opt (1203 ms)
(1.2 ms/frame for 720p)
Which is still slow compared to
I420ToARGB_Opt (735 ms)
Original comment by fbarch...@google.com
on 26 Feb 2015 at 8:46
r1302 adds ARGBToRGB565Dither() function.
As a first step you could call I420ToARGB then ARGBToRGB565Dither.
The dithering is an ordered 8x8 matrix with values 0 to 255.
If you pass NULL for the matrix, the following is used:
static const uint8 kDither8x8[64] = {
0, 128, 32, 160, 8, 136, 40, 168,
192, 64, 224, 96, 200, 72, 232, 104,
48, 176, 16, 144, 56, 184, 24, 152,
240, 112, 208, 80, 248, 120, 216, 88,
12, 140, 44, 172, 4, 132, 36, 164,
204, 76, 236, 108, 196, 68, 228, 100,
60, 188, 28, 156, 52, 180, 20, 148,
252, 124, 220, 92, 244, 116, 212, 84,
};
If you can, give it a try and see if it works for you visually.
I think this can be optimized into Neon, possibly for free.
Often when doing 16 bit fixed point, the final step is v = (v + 128) >> 8.
instead of 128, the dither matrix values could be put into a register, instead
of using a constant.
Or for YUV it may apply to the Y channel bias.
Current C version is 11.9x slower than undithered.
Original comment by fbarch...@google.com
on 27 Feb 2015 at 9:37
r1307 sse2 version ported for windows.
Original comment by fbarch...@google.com
on 4 Mar 2015 at 12:44
Can you confirm the code is performing as expected?
If so I can port it to YUV and arm.
Original comment by fbarch...@google.com
on 4 Mar 2015 at 3:07
Hi, sorry for reply late.
I used r1307 to test. I met some problems.
1. I passed the matrix kDither8x8 as above, the color of output image is wrong,
if I use NULL, it is ok.
2. The output picture became 4 the same picture(repeatedly, the picture is
just 1/4 the one picture(from left site)). What mistake did I make?
My test pattern is 960*540 yuv420,
The input parameters are, Y stride: 960, u stride:420, v stride:420
argb stride:540
rgb565 stride:270
However, I checked the 1/4 picture, I found the colored halo is improved. The
quality of the output picture is improved.
Original comment by michaely...@gmail.com
on 4 Mar 2015 at 9:39
Re stride
If your I420 is 960 x 540 - landscape, no rotation, then the ARGB should also
be 960 x 540 with ARGB stride of 960 * 4 and rgb565 stride of 960 * 2.
Re matrix/NULL
The matrix above has been modified to suit 565.
// Ordered 8x8 dither for 888 to 565. Values from 0 to 7.
static const uint8 kDither565_8x8[64] = {
0 >> 5, 128 >> 5, 32 >> 5, 160 >> 5, 8 >> 5, 136 >> 5, 40 >> 5, 168 >> 5,
192 >> 5, 64 >> 5, 224 >> 5, 96 >> 5, 200 >> 5, 72 >> 5, 232 >> 5, 104 >> 5,
48 >> 5, 176 >> 5, 16 >> 5, 144 >> 5, 56 >> 5, 184 >> 5, 24 >> 5, 152 >> 5,
240 >> 5, 112 >> 5, 208 >> 5, 80 >> 5, 248 >> 5, 120 >> 5, 216 >> 5, 88 >> 5,
12 >> 5, 140 >> 5, 44 >> 5, 172 >> 5, 4 >> 5, 132 >> 5, 36 >> 5, 164 >> 5,
204 >> 5, 76 >> 5, 236 >> 5, 108 >> 5, 196 >> 5, 68 >> 5, 228 >> 5, 100 >> 5,
60 >> 5, 188 >> 5, 28 >> 5, 156 >> 5, 52 >> 5, 180 >> 5, 20 >> 5, 148 >> 5,
252 >> 5, 124 >> 5, 220 >> 5, 92 >> 5, 244 >> 5, 116 >> 5, 212 >> 5, 84 >> 5,
};
A 16x16 would be a little nicer, since it has the full 256 unique values. But
most SIMD instruction sets use 16 byte registers and most functions, such as
I420ToARGB use 16 bit math, so they do 8 pixels at a time. So an 8x8 was done,
which can be implemented efficiently for Neon and SSE2.
Next steos are the AVX2, and then I420ToRGB565 as a 2 step function internally.
Original comment by fbarch...@google.com
on 4 Mar 2015 at 6:28
r1309 ports to AVX2 (on Windows)
AVX2
ARGBToRGB565_Opt (531 ms)
ARGBToRGB565Dither_Opt (625 ms)
SSE2
ARGBToRGB565_Opt (797 ms)
ARGBToRGB565Dither_Opt (813 ms)
A performance difference is noted, which is due to dithering doing to the image
row by row, but undithered does row coalescing. If row coalescing is turned
off, both are similar speed.
Next step is I420ToRGB565.
Original comment by fbarch...@google.com
on 4 Mar 2015 at 10:34
The table for 565 is:
0, 4, 1, 5, 0, 4, 1, 5,
6, 2, 7, 3, 6, 2, 7, 3,
1, 5, 0, 4, 1, 5, 0, 4,
7, 3, 6, 2, 7, 3, 6, 2,
0, 4, 1, 5, 0, 4, 1, 5,
6, 2, 7, 3, 6, 2, 7, 3,
1, 5, 0, 4, 1, 5, 0, 4,
7, 3, 6, 2, 7, 3, 6, 2,
since 4x4 repeats and is sufficient for 4 bits dithered to 8 bits, I plan to
reduce the pattern to 4x4, which will save a register and less calling overhead.
Original comment by fbarch...@google.com
on 4 Mar 2015 at 11:43
I have tested the updated matrix you provided, it works.
However, I still met a problem about stride
My test pattern is yuv420 (960*540)
I use the function(ConvertFromI420)to do YUV to RGB565 conversion. It's ok! (I
describe parameters as following)
ConvertFromI420(src_y_data, 960,src_u_data, 480,src_v_data, 480, target_rgb,
1920, 960, 540, libyuv::FOURCC_RGBP);
But I use I420ToARGB then ARGBToRGB565Dither(I describe parameters as
following), it crashed. What mistake did I make?
I420ToARGBConvertFromI420(src_y_data, 960,src_u_data, 480,src_v_data, 480,
target_rgb, 3840, 960, 540);
I420ToARGBConvertFromI420(src_rgb_data, 3840, dst_rgb_data,1920, NULL, 960,540);
If I chanage argb stride from 3840 to 1920, it works. but it will display 2 1/2
pictures.(I describe parameters as following)
I420ToARGBConvertFromI420(src_y_data, 960,src_u_data, 480,src_v_data, 480,
target_rgb, 1920, 960, 540);
I420ToARGBConvertFromI420(src_rgb_data, 1920, dst_rgb_data,1920, NULL, 960,540);
Original comment by michaely...@gmail.com
on 5 Mar 2015 at 2:30
switching to 4x4
AVX2
ARGBToRGB565_Opt (531 ms)
ARGBToRGB565Dither_Opt (610 ms)
SSE2
ARGBToRGB565_Opt (782 ms)
ARGBToRGB565Dither_Opt (860 ms)
Original comment by fbarch...@google.com
on 7 Mar 2015 at 12:38
I420ToRGB565(src_y_data, 960, src_u_data, 480, src_v_data, 480, dst_rgb_data,
1920, 960, 540);
is equivalent to
I420ToARGB(src_y_data, 960, src_u_data, 480, src_v_data, 480, target_rgb, 3840,
960, 540);
ARGBToRGB565(src_rgb_data, 3840, dst_rgb_data, 1920, 960, 540);
and then dither with
I420ToARGB(src_y_data, 960, src_u_data, 480, src_v_data, 480, target_rgb, 3840,
960, 540);
ARGBToRGB565Dither(src_rgb_data, 3840, dst_rgb_data, 1920, NULL, 960, 540);
Original comment by fbarch...@google.com
on 7 Mar 2015 at 12:44
target_rgb should be src_rgb_data above. Its a temporary 32 bit buffer.
Original comment by fbarch...@google.com
on 7 Mar 2015 at 12:46
a 2 step function is written, but untested. up for review
https://webrtc-codereview.appspot.com/48429004
Original comment by fbarch...@google.com
on 7 Mar 2015 at 4:00
I420ToRGB565Dither() should be functional now, and partially optimized.
AVX2:
ARGBToRGB565_Opt (531 ms)
ARGBToRGB565Dither_Opt (578 ms)
SSSE3:
ARGBToRGB565_Opt (781 ms)
ARGBToRGB565Dither_Opt (860 ms)
Original comment by fbarch...@google.com
on 10 Mar 2015 at 10:56
Original comment by fbarch...@chromium.org
on 16 Mar 2015 at 2:31
Original issue reported on code.google.com by
michaely...@gmail.com
on 26 Feb 2015 at 7:51