I420ToARGB_NEON is fast but still a bottleneck. Can it go faster?

GoogleCodeExporter commented 9 years ago

Profile shows I420ToARGB_NEON is high on profiles.
Benchmarks show its faster than alternatives, but can it be optimized more?
Below is a profile

 Percent |      Source code & Disassembly of GoogleTalkPlugin
------------------------------------------------
        :
        :      /build/daisy/opt/google/talkplugin/GoogleTalkPlugin:     file format elf32-littlearm
        :
        :
        :      Disassembly of section .text:
        :
        :      000dde98 <I422ToARGBRow_NEON>:
        :      #ifdef HAS_I422TOARGBROW_NEON
        :      void I422ToARGBRow_NEON(const uint8* y_buf,
        :                              const uint8* u_buf,
        :                              const uint8* v_buf,
        :                              uint8* rgb_buf,
        :                              int width) {
   0.00 :         dde98:       push    {r4, r5, r6}
        :            "+r"(width)     // %4
        :          : "r"(&kUVToRB),  // %5
        :            "r"(&kUVToG)    // %6
        :          : "cc", "memory", "q0", "q1", "q2", "q3", "q8", "q9",
        :                            "q10", "q11", "q12", "q13", "q14", "q15"
        :        );
   0.00 :         dde9a:       ldr     r6, [pc, #152]  ; (ddf34 <I422ToARGBRow_NEON+0x9c>)
   0.00 :         dde9c:       ldr     r5, [pc, #152]  ; (ddf38 <I422ToARGBRow_NEON+0xa0>)
   0.00 :         dde9e:       ldr     r4, [sp, #12]
   0.00 :         ddea0:       add     r6, pc
   0.00 :         ddea2:       add     r5, pc
   0.00 :         ddea4:       vld1.8  {d24}, [r5]
   0.00 :         ddea8:       vld1.8  {d25}, [r6]
   0.00 :         ddeac:       vmov.i8 d26, #128       ; 0x80
   0.00 :         ddeb0:       vmov.i16        q14, #74        ; 0x004a
   0.00 :         ddeb4:       vmov.i16        q15, #16        ; 0x0010
  19.49 :         ddeb8:       vld1.8  {d0}, [r0]!
   5.06 :         ddebc:       vld1.32 {d2[0]}, [r1]!
   4.81 :         ddec0:       vld1.32 {d2[1]}, [r2]!
   0.00 :         ddec4:       veor    d2, d2, d26
   0.00 :         ddec8:       vmull.s8        q8, d2, d24
   0.00 :         ddecc:       vmull.s8        q9, d2, d25
   1.27 :         dded0:       vmov.i8 d1, #0  ; 0x00
   0.00 :         dded4:       vtrn.8  d0, d1
   0.00 :         dded8:       vsub.i16        q0, q0, q15
   0.00 :         ddedc:       vmul.i16        q0, q0, q14
   5.06 :         ddee0:       vadd.i16        d18, d18, d19
   0.00 :         ddee4:       vqadd.s16       d20, d0, d16
   0.00 :         ddee8:       vqadd.s16       d21, d1, d16
   0.00 :         ddeec:       vqadd.s16       d22, d0, d17
   5.06 :         ddef0:       vqadd.s16       d23, d1, d17
   0.00 :         ddef4:       vqadd.s16       d16, d0, d18
   0.00 :         ddef8:       vqadd.s16       d17, d1, d18
   0.00 :         ddefc:       vqrshrun.s16    d0, q10, #6
  27.09 :         ddf00:       vqrshrun.s16    d1, q11, #6
   0.00 :         ddf04:       vqrshrun.s16    d2, q8, #6
   0.00 :         ddf08:       vmovl.u8        q10, d0
   0.00 :         ddf0c:       vmovl.u8        q11, d1
   5.57 :         ddf10:       vmovl.u8        q8, d2
   0.00 :         ddf14:       vtrn.8  d20, d21
   0.00 :         ddf18:       vtrn.8  d22, d23
   0.00 :         ddf1c:       vtrn.8  d16, d17
   4.05 :         ddf20:       vorr    d21, d16, d16
   0.00 :         ddf24:       vmov.i8 d23, #255       ; 0xff
  20.76 :         ddf28:       vst4.8  {d20-d23}, [r3]!
   0.00 :         ddf2c:       subs    r4, #8
   0.00 :         ddf2e:       bgt.n   ddeb8 <I422ToARGBRow_NEON+0x20>
        :      }
   1.77 :         ddf30:       pop     {r4, r5, r6}
   0.00 :         ddf32:       bx      lr
   0.00 :         ddf34:       .word   0x0031259c
   0.00 :         ddf38:       .word   0x0031258a

Original issue reported on code.google.com by fbarch...@google.com on 9 Aug 2012 at 5:51

GoogleCodeExporter commented 9 years ago

Original comment by fbarch...@google.com on 11 Sep 2012 at 1:48

Changed title: I420ToARGB_NEON is fast but still a bottleneck. Can it go faster?

GoogleCodeExporter commented 9 years ago

The first small change that should help is aligned loads/stores.
This should be done for all Neon, not just this one, so opening a new bug.

The second change is to break the function in fetch, convert, store, and do 
fetchs for other YUV formats, and stores for other RGB formats.  This would 
avoid multistep conversions.  RGB 24 bit should be easy vst3.8  {d20-d22}, [r3]!
The complications are the calling code, the SSSE3 which is not trivial, and 
rgb565/1555/5555 which are not trivial.

vtrn.8  d16, d17 should be changed to put its value in d21, since its Green and 
is always between R and B for all RGB formats.  Buts its not clear how to do 
that without adding a vmov.
Overall register usage is poor and too many 'd' instructions.. should be 'q'.  
Perhaps do 16 pixels.

Original comment by fbarch...@google.com on 4 Oct 2012 at 1:02

GoogleCodeExporter commented 9 years ago

Indications are current code has stalls.  
Suggest replication be done differently and avoid multply stalls.

Original comment by fbarch...@google.com on 12 Jan 2013 at 9:25

Changed state: WontFix

watery01 / libyuv

I420ToARGB_NEON is fast but still a bottleneck. Can it go faster? #67