Closed GoogleCodeExporter closed 9 years ago
r782
AVX2 TestARGBPolynomial (1591 ms)
SSE2 TestARGBPolynomial (2018 ms)
C TestARGBPolynomial (12815 ms)
Original comment by fbarch...@google.com
on 10 Sep 2013 at 8:18
Ported to gcc and NaCL.
Marking as fixed. Will port to Neon if/when the function proves useful.
Original comment by fbarch...@google.com
on 12 Sep 2013 at 1:15
Original comment by fbarch...@google.com
on 12 Sep 2013 at 1:19
FMA3 TestARGBPolynomial (1294 ms)
C code TestARGBPolynomial (14477 ms)
11.18x faster.
Original comment by fbarch...@google.com
on 16 Sep 2013 at 8:04
Bottleneck is unpack and conversion, not the math.
Original comment by fbarch...@google.com
on 17 Sep 2013 at 8:13
bash-3.2$ clang++ -c source/row_posix.cc -I include/
bash-3.2$ g++ -c source/row_posix.cc -I include/
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6609:no such
instruction: `vmovdqu (%rax),%xmm4'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6610:no such
instruction: `vmovdqu 0x10(%rax),%xmm5'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6611:no such
instruction: `vmovdqu 0x20(%rax),%xmm6'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6612:no such
instruction: `vmovdqu 0x30(%rax),%xmm7'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6613:no such
instruction: `vpermq $0x44,%ymm4,%ymm4'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6614:no such
instruction: `vpermq $0x44,%ymm5,%ymm5'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6615:no such
instruction: `vpermq $0x44,%ymm6,%ymm6'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6616:no such
instruction: `vpermq $0x44,%ymm7,%ymm7'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6619:no such
instruction: `vpmovzxbd (%rcx),%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6621:no such
instruction: `vcvtdq2ps %ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6622:no such
instruction: `vmulps %ymm0,%ymm0,%ymm2'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6623:no such
instruction: `vmulps %ymm7,%ymm0,%ymm3'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6624:no such
instruction: `vfmadd132ps %ymm5,%ymm4,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6625:no such
instruction: `vfmadd231ps %ymm6,%ymm2,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6626:no such
instruction: `vfmadd231ps %ymm3,%ymm2,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6627:no such
instruction: `vcvttps2dq %ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6628:no such
instruction: `vpackusdw %ymm0,%ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6629:no such
instruction: `vpermq $0xd8,%ymm0,%ymm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6630:no such
instruction: `vpackuswb %xmm0,%xmm0,%xmm0'
/var/folders/00/0pldr000h01000cxqpysvccm002tdq/T//ccBVaUdW.s:6635:no such
instruction: `vzeroupper'
bash-3.2$
row.h change to
defined(__native_client__) || defined(__clang__))
Original comment by fbarch...@google.com
on 20 Sep 2013 at 8:32
r796 limits AVX2 to clang front end. Tests okay on XCode 4.5.
As this function is complete on x86, marking as fixed.
Future work includes Neon (vfpu) port, and fixed point variation.
Original comment by fbarch...@google.com
on 23 Sep 2013 at 7:49
Polynomial AVX2 is slow
fbarchard-macbookair2:yuv fbarchard$ LIBYUV_DISABLE_ASM=1 ./runyuv10 Poly*
LIBYUV_WIDTH=640 LIBYUV_HEIGHT=360 LIBYUV_REPEAT=4000
out/Release/libyuv_unittest --gtest_filter=*Poly* | sed 's/\(.*(\)\([0-9]*\)\(
ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
9960 - [ OK ] libyuvTest.TestARGBPolynomial (9960 ms)
[==========] 1 test from 1 test case ran. (9960 ms total)
[----------] 1 test from libyuvTest (9960 ms total)
fbarchard-macbookair2:yuv fbarchard$ LIBYUV_DISABLE_AVX2=1 ./runyuv10 Poly*
LIBYUV_WIDTH=640 LIBYUV_HEIGHT=360 LIBYUV_REPEAT=4000
out/Release/libyuv_unittest --gtest_filter=*Poly* | sed 's/\(.*(\)\([0-9]*\)\(
ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
1861 - [ OK ] libyuvTest.TestARGBPolynomial (1861 ms)
[==========] 1 test from 1 test case ran. (1862 ms total)
[----------] 1 test from libyuvTest (1861 ms total)
fbarchard-macbookair2:yuv fbarchard$ ./runyuv10 Poly*
LIBYUV_WIDTH=640 LIBYUV_HEIGHT=360 LIBYUV_REPEAT=4000
out/Release/libyuv_unittest --gtest_filter=*Poly* | sed 's/\(.*(\)\([0-9]*\)\(
ms)\)/\2 - \1\2\3/g' | sort -rn | grep ms
30936 - [ OK ] libyuvTest.TestARGBPolynomial (30936 ms)
[==========] 1 test from 1 test case ran. (30936 ms total)
Original comment by fbarch...@google.com
on 20 Oct 2013 at 7:08
Fixed
Now TestARGBPolynomial (1155 ms)
Original comment by fbarch...@google.com
on 21 Oct 2013 at 9:01
Original issue reported on code.google.com by
fbarch...@google.com
on 3 Sep 2013 at 7:30