watery01 / libyuv

Automatically exported from code.google.com/p/libyuv
0 stars 0 forks source link

RGBColorTable is slow #266

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
C color table
TestRGBColorTable (1233 ms)

X86
TestRGBColorTable (1604 ms)

The C code (VS2012) uses 3 instructions per channel.
movzx byte fetch
movzx table
mov byte store
Reusing the same register (ecx).  A modern CPU will register rename and this 
wont stall, but will saturate on load/store units.

Tried different order (2 pixels of red) and it was a win, but register pressure 
reduced the win.

Original issue reported on code.google.com by fbarch...@google.com on 10 Sep 2013 at 2:37

GoogleCodeExporter commented 9 years ago
r791 using x64 assembly
[       OK ] libyuvTest.TestARGBColorTable (1748 ms)
[       OK ] libyuvTest.TestRGBColorTable (1299 ms)

r780 using llvm x64 C
[       OK ] libyuvTest.TestARGBColorTable (1817 ms)
[       OK ] libyuvTest.TestRGBColorTable (1344 ms)

C
pushq               %rbp
+0x01   movq                %rsp, %rbp
+0x04   testl               %edx, %edx
+0x06   jle                 ARGBColorTableRow_C+0x46
+0x08   movl                %edx, %eax
+0x0a   addq                $3, %rdi
+0x0e   nop                 
+0x10       movzbl              -3(%rdi), %ecx
+0x14       movb                (%rsi,%rcx,4), %cl
+0x17       movzbl              (%rdi), %edx
+0x1a       movzbl              -1(%rdi), %r8d
+0x1f       movzbl              -2(%rdi), %r9d
+0x24       movb                %cl, -3(%rdi)
+0x27       movb                1(%rsi,%r9,4), %cl
+0x2c       movb                %cl, -2(%rdi)
+0x2f       movb                2(%rsi,%r8,4), %cl
+0x34       movb                %cl, -1(%rdi)
+0x37       movb                3(%rsi,%rdx,4), %cl
+0x3b       movb                %cl, (%rdi)
+0x3d       addq                $4, %rdi
+0x41       decq                %rax
+0x44       jne                 ARGBColorTableRow_C+0x10
+0x46   popq                %rbp
+0x47   ret                 

asm
+0x00   pushq               %rbp
+0x01   movq                %rsp, %rbp
+0x04   movl                %edx, %eax
+0x06   xorl                %edx, %edx
+0x08   nopl                (%rax,%rax)
+0x10   movzbq              (%rdi), %rdx
+0x14   leaq                4(%rdi), %rdi
+0x18   movzbq              (%rsi,%rdx,4), %rdx
+0x1d   movb                %dl, -4(%rdi)
+0x20   movzbq              -3(%rdi), %rdx
+0x25   movzbq              1(%rsi,%rdx,4), %rdx
+0x2b   movb                %dl, -3(%rdi)
+0x2e   movzbq              -2(%rdi), %rdx
+0x33   movzbq              2(%rsi,%rdx,4), %rdx
+0x39   movb                %dl, -2(%rdi)
+0x3c   movzbq              -1(%rdi), %rdx
+0x41   movzbq              3(%rsi,%rdx,4), %rdx
+0x47   movb                %dl, -1(%rdi)
+0x4a   decl                %eax
+0x4c   jg                  ARGBColorTableRow_X86+0x10
+0x4e   popq                %rbp
+0x4f   ret     

Original comment by fbarch...@google.com on 17 Sep 2013 at 7:39

GoogleCodeExporter commented 9 years ago
Linux x64
TestARGBLumaColorTable (1988 ms)
TestARGBColorTable (1610 ms)
TestRGBColorTable (1162 ms)

Original comment by fbarch...@google.com on 23 Sep 2013 at 7:51