watery01 / libyuv

Automatically exported from code.google.com/p/libyuv
0 stars 0 forks source link

ARGBPolynomial #265

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Impliment a polynomial for ARGB pixels.

Original issue reported on code.google.com by fbarch...@google.com on 3 Sep 2013 at 7:30

GoogleCodeExporter commented 9 years ago
r782
AVX2 TestARGBPolynomial (1591 ms)
SSE2 TestARGBPolynomial (2018 ms)
C    TestARGBPolynomial (12815 ms)

Original comment by fbarch...@google.com on 10 Sep 2013 at 8:18

GoogleCodeExporter commented 9 years ago
Ported to gcc and NaCL.
Marking as fixed.  Will port to Neon if/when the function proves useful.

Original comment by fbarch...@google.com on 12 Sep 2013 at 1:15

GoogleCodeExporter commented 9 years ago

Original comment by fbarch...@google.com on 12 Sep 2013 at 1:19

GoogleCodeExporter commented 9 years ago
FMA3 TestARGBPolynomial (1294 ms)
C code TestARGBPolynomial (14477 ms)
11.18x faster.

Original comment by fbarch...@google.com on 16 Sep 2013 at 8:04

GoogleCodeExporter commented 9 years ago
Bottleneck is unpack and conversion, not the math.

Original comment by fbarch...@google.com on 17 Sep 2013 at 8:13

GoogleCodeExporter commented 9 years ago
bash-3.2$ clang++ -c source/row_posix.cc -I include/
bash-3.2$ g++ -c source/row_posix.cc -I include/
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6609:no such 
instruction: `vmovdqu (%rax),%xmm4'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6610:no such 
instruction: `vmovdqu 0x10(%rax),%xmm5'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6611:no such 
instruction: `vmovdqu 0x20(%rax),%xmm6'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6612:no such 
instruction: `vmovdqu 0x30(%rax),%xmm7'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6613:no such 
instruction: `vpermq $0x44,%ymm4,%ymm4'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6614:no such 
instruction: `vpermq $0x44,%ymm5,%ymm5'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6615:no such 
instruction: `vpermq $0x44,%ymm6,%ymm6'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6616:no such 
instruction: `vpermq $0x44,%ymm7,%ymm7'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6619:no such 
instruction: `vpmovzxbd (%rcx),%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6621:no such 
instruction: `vcvtdq2ps %ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6622:no such 
instruction: `vmulps %ymm0,%ymm0,%ymm2'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6623:no such 
instruction: `vmulps %ymm7,%ymm0,%ymm3'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6624:no such 
instruction: `vfmadd132ps %ymm5,%ymm4,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6625:no such 
instruction: `vfmadd231ps %ymm6,%ymm2,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6626:no such 
instruction: `vfmadd231ps %ymm3,%ymm2,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6627:no such 
instruction: `vcvttps2dq %ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6628:no such 
instruction: `vpackusdw %ymm0,%ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6629:no such 
instruction: `vpermq $0xd8,%ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6630:no such 
instruction: `vpackuswb %xmm0,%xmm0,%xmm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6635:no such 
instruction: `vzeroupper'
bash-3.2$ 

row.h change to
  defined(__native_client__) || defined(__clang__))

Original comment by fbarch...@google.com on 20 Sep 2013 at 8:32

GoogleCodeExporter commented 9 years ago
r796 limits AVX2 to clang front end.  Tests okay on XCode 4.5.
As this function is complete on x86, marking as fixed.
Future work includes Neon (vfpu) port, and fixed point variation.

Original comment by fbarch...@google.com on 23 Sep 2013 at 7:49

GoogleCodeExporter commented 9 years ago
Polynomial AVX2 is slow

fbarchard-macbookair2:yuv fbarchard$ LIBYUV_DISABLE_ASM=1 ./runyuv10 Poly*
LIBYUV_WIDTH=640 LIBYUV_HEIGHT=360 LIBYUV_REPEAT=4000 
out/Release/libyuv_unittest --gtest_filter=*Poly* | sed 's/\(.*(\)\([0-9]*\)\( 
ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
9960 - [       OK ] libyuvTest.TestARGBPolynomial (9960 ms)
[==========] 1 test from 1 test case ran. (9960 ms total)
[----------] 1 test from libyuvTest (9960 ms total)
fbarchard-macbookair2:yuv fbarchard$ LIBYUV_DISABLE_AVX2=1 ./runyuv10 Poly*
LIBYUV_WIDTH=640 LIBYUV_HEIGHT=360 LIBYUV_REPEAT=4000 
out/Release/libyuv_unittest --gtest_filter=*Poly* | sed 's/\(.*(\)\([0-9]*\)\( 
ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
1861 - [       OK ] libyuvTest.TestARGBPolynomial (1861 ms)
[==========] 1 test from 1 test case ran. (1862 ms total)
[----------] 1 test from libyuvTest (1861 ms total)
fbarchard-macbookair2:yuv fbarchard$ ./runyuv10 Poly*
LIBYUV_WIDTH=640 LIBYUV_HEIGHT=360 LIBYUV_REPEAT=4000 
out/Release/libyuv_unittest --gtest_filter=*Poly* | sed 's/\(.*(\)\([0-9]*\)\( 
ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
30936 - [       OK ] libyuvTest.TestARGBPolynomial (30936 ms)
[==========] 1 test from 1 test case ran. (30936 ms total)

Original comment by fbarch...@google.com on 20 Oct 2013 at 7:08

GoogleCodeExporter commented 9 years ago
Fixed
Now TestARGBPolynomial (1155 ms)

Original comment by fbarch...@google.com on 21 Oct 2013 at 9:01