Recommend support dithering in the function "ConvertFromI420"

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Use ConvertFromI420 to do yuv420 to rgb565 conversion (Both C and NEON 
version)
2.
3.

What is the expected output? What do you see instead?
The function ConvertFromI420 for rgb565 will produce colored halo on the output 
picture. It can be eliminate the phenomenon.

What version of the product are you using? On what operating system?
1. The up-to-date version
2. Android 4.4.2

Please provide any additional information below.

Original issue reported on code.google.com by michaely...@gmail.com on 26 Feb 2015 at 7:51

GoogleCodeExporter commented 8 years ago

Can any propose code or method that can implement in SIMD?
On both x86 and neon versions, I've isolated the code that converts 24 bit RGB 
to 565.

This is from row_neon.cc:
#define ARGBTORGB565                                                           \
    "vshr.u8    d20, d20, #3                   \n"  /* B                    */ \
    "vshr.u8    d21, d21, #2                   \n"  /* G                    */ \
    "vshr.u8    d22, d22, #3                   \n"  /* R                    */ \
    "vmovl.u8   q8, d20                        \n"  /* B                    */ \
    "vmovl.u8   q9, d21                        \n"  /* G                    */ \
    "vmovl.u8   q10, d22                       \n"  /* R                    */ \
    "vshl.u16   q9, q9, #5                     \n"  /* G                    */ \
    "vshl.u16   q10, q10, #11                  \n"  /* R                    */ \
    "vorr       q0, q8, q9                     \n"  /* BG                   */ \
    "vorr       q0, q0, q10                    \n"  /* BGR                  */

and the SSSE3 code
// Store 8 RGB565 values.
#define STORERGB565 __asm {                                                    \
    /* Step 3: Weave into RRGB */                                              \
    __asm punpcklbw  xmm0, xmm1           /* BG */                             \
    __asm punpcklbw  xmm2, xmm2           /* RR */                             \
    __asm movdqa     xmm1, xmm0                                                \
    __asm punpcklwd  xmm0, xmm2           /* BGRR first 4 pixels */            \
    __asm punpckhwd  xmm1, xmm2           /* BGRR next 4 pixels */             \
    /* Step 4: RRGB -> RGB565 */                                               \
    __asm movdqa     xmm3, xmm0    /* B  first 4 pixels of argb */             \
    __asm movdqa     xmm2, xmm0    /* G */                                     \
    __asm pslld      xmm0, 8       /* R */                                     \
    __asm psrld      xmm3, 3       /* B */                                     \
    __asm psrld      xmm2, 5       /* G */                                     \
    __asm psrad      xmm0, 16      /* R */                                     \
    __asm pand       xmm3, xmm5    /* B */                                     \
    __asm pand       xmm2, xmm6    /* G */                                     \
    __asm pand       xmm0, xmm7    /* R */                                     \
    __asm por        xmm3, xmm2    /* BG */                                    \
    __asm por        xmm0, xmm3    /* BGR */                                   \
    __asm movdqa     xmm3, xmm1    /* B  next 4 pixels of argb */              \
    __asm movdqa     xmm2, xmm1    /* G */                                     \
    __asm pslld      xmm1, 8       /* R */                                     \
    __asm psrld      xmm3, 3       /* B */                                     \
    __asm psrld      xmm2, 5       /* G */                                     \
    __asm psrad      xmm1, 16      /* R */                                     \
    __asm pand       xmm3, xmm5    /* B */                                     \
    __asm pand       xmm2, xmm6    /* G */                                     \
    __asm pand       xmm1, xmm7    /* R */                                     \
    __asm por        xmm3, xmm2    /* BG */                                    \
    __asm por        xmm1, xmm3    /* BGR */                                   \
    __asm packssdw   xmm0, xmm1                                                \
    __asm movdqu     [edx], xmm0   /* store 8 pixels of RGB565 */              \
    __asm lea        edx, [edx + 16]                                           \
  }

A first step might be a function that converts ARGBToRGB565 with dithering.  If 
it can be done a row at a time, the higher level could call I422ToARGBRow_NEON 
then ARGBToRGB565Row_NEON with dithering and still be reasonably fast, 
depending how the dithering is implemented.

Original comment by fbarch...@google.com on 26 Feb 2015 at 6:57

GoogleCodeExporter commented 8 years ago

FWIW I just optimized the AVX2 version of that:

Was
I420ToRGB565_Opt (1844 ms)

Now
I420ToRGB565_Opt (1203 ms)

(1.2 ms/frame for 720p)

Which is still slow compared to
I420ToARGB_Opt (735 ms)

Original comment by fbarch...@google.com on 26 Feb 2015 at 8:46

GoogleCodeExporter commented 8 years ago

r1302 adds ARGBToRGB565Dither() function.
As a first step you could call I420ToARGB then ARGBToRGB565Dither.
The dithering is an ordered 8x8 matrix with values 0 to 255.

If you pass NULL for the matrix, the following is used:
static const uint8 kDither8x8[64] = {
  0, 128, 32, 160,  8, 136, 40, 168,
  192, 64, 224, 96, 200, 72, 232, 104,
  48, 176, 16, 144, 56, 184, 24, 152,
  240, 112, 208, 80, 248, 120, 216, 88,
  12, 140, 44, 172,  4, 132, 36, 164,
  204, 76, 236, 108, 196, 68, 228, 100,
  60, 188, 28, 156, 52, 180, 20, 148,
  252, 124, 220, 92, 244, 116, 212, 84,
};

If you can, give it a try and see if it works for you visually.
I think this can be optimized into Neon, possibly for free.
Often when doing 16 bit fixed point, the final step is v = (v + 128) >> 8.
instead of 128, the dither matrix values could be put into a register, instead 
of using a constant.
Or for YUV it may apply to the Y channel bias.
Current C version is 11.9x slower than undithered.

Original comment by fbarch...@google.com on 27 Feb 2015 at 9:37

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

r1307 sse2 version ported for windows.

Original comment by fbarch...@google.com on 4 Mar 2015 at 12:44

Changed state: Started

GoogleCodeExporter commented 8 years ago

Can you confirm the code is performing as expected?

If so I can port it to YUV and arm.

Original comment by fbarch...@google.com on 4 Mar 2015 at 3:07

GoogleCodeExporter commented 8 years ago

Hi, sorry for reply late.

I used r1307 to test. I met some problems.
1. I passed the matrix kDither8x8 as above, the color of output image is wrong, 
if I use NULL, it is ok.
2. The output picture became 4  the same picture(repeatedly, the picture is 
just 1/4 the one picture(from left site)). What mistake did I make?
My test pattern is 960*540 yuv420,  
The input parameters are, Y stride: 960, u stride:420, v stride:420
argb stride:540
rgb565 stride:270

However,  I checked the 1/4 picture, I found the colored halo is improved. The 
quality of the output picture is improved.

Original comment by michaely...@gmail.com on 4 Mar 2015 at 9:39

GoogleCodeExporter commented 8 years ago

Re stride
If your I420 is 960 x 540 - landscape, no rotation, then the ARGB should also 
be 960 x 540 with ARGB stride of 960 * 4 and rgb565 stride of 960 * 2.

Re matrix/NULL
The matrix above has been modified to suit 565.

// Ordered 8x8 dither for 888 to 565.  Values from 0 to 7.
static const uint8 kDither565_8x8[64] = {
  0 >> 5, 128 >> 5, 32 >> 5, 160 >> 5,  8 >> 5, 136 >> 5, 40 >> 5, 168 >> 5,
  192 >> 5, 64 >> 5, 224 >> 5, 96 >> 5, 200 >> 5, 72 >> 5, 232 >> 5, 104 >> 5,
  48 >> 5, 176 >> 5, 16 >> 5, 144 >> 5, 56 >> 5, 184 >> 5, 24 >> 5, 152 >> 5,
  240 >> 5, 112 >> 5, 208 >> 5, 80 >> 5, 248 >> 5, 120 >> 5, 216 >> 5, 88 >> 5,
  12 >> 5, 140 >> 5, 44 >> 5, 172 >> 5,  4 >> 5, 132 >> 5, 36 >> 5, 164 >> 5,
  204 >> 5, 76 >> 5, 236 >> 5, 108 >> 5, 196 >> 5, 68 >> 5, 228 >> 5, 100 >> 5,
  60 >> 5, 188 >> 5, 28 >> 5, 156 >> 5, 52 >> 5, 180 >> 5, 20 >> 5, 148 >> 5,
  252 >> 5, 124 >> 5, 220 >> 5, 92 >> 5, 244 >> 5, 116 >> 5, 212 >> 5, 84 >> 5,
};

A 16x16 would be a little nicer, since it has the full 256 unique values.  But 
most SIMD instruction sets use 16 byte registers and most functions, such as 
I420ToARGB use 16 bit math, so they do 8 pixels at a time.  So an 8x8 was done, 
which can be implemented efficiently for Neon and SSE2.

Next steos are the AVX2, and then I420ToRGB565 as a 2 step function internally.

Original comment by fbarch...@google.com on 4 Mar 2015 at 6:28

GoogleCodeExporter commented 8 years ago

r1309 ports to AVX2 (on Windows)

AVX2
ARGBToRGB565_Opt (531 ms)
ARGBToRGB565Dither_Opt (625 ms)

SSE2
ARGBToRGB565_Opt (797 ms)
ARGBToRGB565Dither_Opt (813 ms)

A performance difference is noted, which is due to dithering doing to the image 
row by row, but undithered does row coalescing.  If row coalescing is turned 
off, both are similar speed.

Next step is I420ToRGB565.

Original comment by fbarch...@google.com on 4 Mar 2015 at 10:34

GoogleCodeExporter commented 8 years ago

The table for 565 is:
0, 4, 1, 5, 0, 4, 1, 5,
6, 2, 7, 3, 6, 2, 7, 3,
1, 5, 0, 4, 1, 5, 0, 4,
7, 3, 6, 2, 7, 3, 6, 2,
0, 4, 1, 5, 0, 4, 1, 5,
6, 2, 7, 3, 6, 2, 7, 3,
1, 5, 0, 4, 1, 5, 0, 4,
7, 3, 6, 2, 7, 3, 6, 2,

since 4x4 repeats and is sufficient for 4 bits dithered to 8 bits, I plan to 
reduce the pattern to 4x4, which will save a register and less calling overhead.

Original comment by fbarch...@google.com on 4 Mar 2015 at 11:43

GoogleCodeExporter commented 8 years ago

I have tested the updated matrix you provided, it works.

However, I still met a problem about stride

My test pattern is yuv420 (960*540)

I use the function(ConvertFromI420)to do YUV to RGB565 conversion. It's ok! (I 
describe parameters as following)
ConvertFromI420(src_y_data, 960,src_u_data, 480,src_v_data, 480, target_rgb, 
1920, 960, 540, libyuv::FOURCC_RGBP);

But I use I420ToARGB then ARGBToRGB565Dither(I describe parameters as 
following), it crashed. What mistake did I make?

I420ToARGBConvertFromI420(src_y_data, 960,src_u_data, 480,src_v_data, 480, 
target_rgb, 3840, 960, 540);
I420ToARGBConvertFromI420(src_rgb_data, 3840, dst_rgb_data,1920, NULL, 960,540);

If I chanage argb stride from 3840 to 1920, it works. but it will display 2 1/2 
pictures.(I describe parameters as following)

I420ToARGBConvertFromI420(src_y_data, 960,src_u_data, 480,src_v_data, 480, 
target_rgb, 1920, 960, 540);
I420ToARGBConvertFromI420(src_rgb_data, 1920, dst_rgb_data,1920, NULL, 960,540);

Original comment by michaely...@gmail.com on 5 Mar 2015 at 2:30

GoogleCodeExporter commented 8 years ago

switching to 4x4

AVX2
ARGBToRGB565_Opt (531 ms)
ARGBToRGB565Dither_Opt (610 ms)

SSE2
ARGBToRGB565_Opt (782 ms)
ARGBToRGB565Dither_Opt (860 ms)

Original comment by fbarch...@google.com on 7 Mar 2015 at 12:38

GoogleCodeExporter commented 8 years ago

I420ToRGB565(src_y_data, 960, src_u_data, 480, src_v_data, 480,  dst_rgb_data, 
1920, 960, 540);

is equivalent to

I420ToARGB(src_y_data, 960, src_u_data, 480, src_v_data, 480, target_rgb, 3840, 
960, 540);
ARGBToRGB565(src_rgb_data, 3840, dst_rgb_data, 1920, 960, 540);

and then dither with

I420ToARGB(src_y_data, 960, src_u_data, 480, src_v_data, 480, target_rgb, 3840, 
960, 540);
ARGBToRGB565Dither(src_rgb_data, 3840, dst_rgb_data, 1920, NULL, 960, 540);

Original comment by fbarch...@google.com on 7 Mar 2015 at 12:44

GoogleCodeExporter commented 8 years ago

target_rgb should be src_rgb_data above. Its a temporary 32 bit buffer.

Original comment by fbarch...@google.com on 7 Mar 2015 at 12:46

GoogleCodeExporter commented 8 years ago

a 2 step function is written, but untested.  up for review
https://webrtc-codereview.appspot.com/48429004

Original comment by fbarch...@google.com on 7 Mar 2015 at 4:00

GoogleCodeExporter commented 8 years ago

I420ToRGB565Dither() should be functional now, and partially optimized.

AVX2:
ARGBToRGB565_Opt (531 ms)
ARGBToRGB565Dither_Opt (578 ms)

SSSE3:
ARGBToRGB565_Opt (781 ms)
ARGBToRGB565Dither_Opt (860 ms)

Original comment by fbarch...@google.com on 10 Mar 2015 at 10:56

GoogleCodeExporter commented 8 years ago

Original comment by fbarch...@chromium.org on 16 Mar 2015 at 2:31

Changed state: Fixed

jaqquery / libyuv

Recommend support dithering in the function "ConvertFromI420" #407