Possible issue with SSE2 implementation of IDCT functions

GoogleCodeExporter commented 9 years ago

Hi,

We have a coded VP8 test stream contains large AC coefficient. We find 
libvpx0.9.2 generate different YUV output between C implementation of 
“idct_add_y_block” function “vp8_dequant_idct_add_y_block_c” 
 and SSE2 implementation of  it  “vp8_dequant_idct_add_y_block_sse2”. 

The mismatch is in frame 5 (counting from 0 ) at MBA (1, 1), the problem block 
has a large coefficient -1116 (which is within sign 12-bit limit) that cause 
SSE2 IDCT overflow.  Please help confirm it.

The test stream is attached.

Thanks,

Olive

Original issue reported on code.google.com by olive...@yahoo.com on 21 Sep 2010 at 4:59

Attachments:

test.ivf

GoogleCodeExporter commented 9 years ago

Hi Olive,

I can confirm what you're seeing. I see the different behavior with MMX-only as 
well as SSE2, though I didn't try to isolate it to a function. If you isolated 
it to add_y_block_sse2, I assume add_y_block_mmx has the same bug.

Original comment by jkoles...@google.com on 24 Sep 2010 at 3:05

Changed state: Accepted
Added labels: Component-libvpx, Priority-High, Type-Defect

GoogleCodeExporter commented 9 years ago

Olive,

I wasn't able to locate the -1116 coeffieient you mention, but I instrumented 
vp8_dequant_idct_add_y_block_c as follows, and it seems that the dequantized 
coefficients exceed the range of valid inputs to the IDCT. In this case, I 
think the behavior is undefined, so the C not matching the assembly isn't a 
problem.

diff --git a/vp8/decoder/idct_blk.c b/vp8/decoder/idct_blk.c
index c6a4257..1208bd1 100644
--- a/vp8/decoder/idct_blk.c
+++ b/vp8/decoder/idct_blk.c
@@ -48,6 +48,12 @@ void vp8_dequant_idct_add_y_block_c
     {
         for (j = 0; j < 4; j++)
         {
+int k;
+for (k = 0; k < 16; k++)
+    if(abs(q[k]*dq[k]) >= 2047)
+        printf("%s found (%d*%d=%d) at %d,%d,%d.\n",
+               __PRETTY_FUNCTION__,q[k],dq[k],q[k]*dq[k],i,j,k);
+

$ VPX_SIMD_CAPS=0 ./ivfdec -o /dev/null --progress ~/Downloads/test.ivf 

WebM Project VP8 Decoder v0.9.2-65-g7288cdf
decoded frame 1.
decoded frame 2.
decoded frame 3.
decoded frame 4.
vp8_dequant_idct_add_y_block_c found (-87*24=-2088) at 1,0,1.
vp8_dequant_idct_add_y_block_c found (93*24=2232) at 0,1,1.
vp8_dequant_idct_add_y_block_c found (207*24=4968) at 2,0,2.
decoded frame 5.
vp8_dequant_idct_add_y_block_c found (-159*21=-3339) at 0,2,0.
vp8_dequant_idct_add_y_block_c found (108*24=2592) at 0,2,6.
vp8_dequant_idct_add_y_block_c found (-298*24=-7152) at 3,3,4.
vp8_dequant_idct_add_y_block_c found (-91*24=-2184) at 3,0,6.
vp8_dequant_idct_add_y_block_c found (-95*24=-2280) at 0,1,3.
vp8_dequant_idct_add_y_block_c found (-101*24=-2424) at 2,1,3.
vp8_dequant_idct_add_y_block_c found (88*24=2112) at 0,0,8.
vp8_dequant_idct_add_y_block_c found (-426*24=-10224) at 2,0,8.
vp8_dequant_idct_add_y_block_c found (142*24=3408) at 1,3,4.
vp8_dequant_idct_add_y_block_c found (-255*24=-6120) at 2,1,3.
decoded frame 6.
vp8_dequant_idct_add_y_block_c found (213*21=4473) at 0,3,0.
vp8_dequant_idct_add_y_block_c found (-101*21=-2121) at 0,2,0.
vp8_dequant_idct_add_y_block_c found (93*24=2232) at 1,2,4.
vp8_dequant_idct_add_y_block_c found (-104*24=-2496) at 1,0,4.
vp8_dequant_idct_add_y_block_c found (584*24=14016) at 3,3,4.
vp8_dequant_idct_add_y_block_c found (-147*21=-3087) at 0,3,0.
vp8_dequant_idct_add_y_block_c found (124*24=2976) at 3,1,4.
decoded frame 7.
decoded frame 8.
vp8_dequant_idct_add_y_block_c found (-128*24=-3072) at 2,0,9.
vp8_dequant_idct_add_y_block_c found (90*24=2160) at 1,1,1.
vp8_dequant_idct_add_y_block_c found (-137*24=-3288) at 3,0,3.
vp8_dequant_idct_add_y_block_c found (-214*24=-5136) at 3,2,5.
vp8_dequant_idct_add_y_block_c found (178*24=4272) at 3,1,11.
decoded frame 9.
vp8_dequant_idct_add_y_block_c found (411*24=9864) at 3,2,2.
vp8_dequant_idct_add_y_block_c found (263*24=6312) at 3,2,15.
decoded frame 10.

To confirm, you're generating this file as a test case and it's not the output 
of the  unmodified encoder, correct?

Original comment by jkoles...@google.com on 29 Sep 2010 at 1:48

Added labels: Priority-Medium
Removed labels: Priority-High

GoogleCodeExporter commented 9 years ago

From the spec. section 13.2 Coding of individual Coefficient Values,

"The tokens dct_cat1 … dct_cat6 specify ranges of unsigned values, the value 
within the range being formed by adding an unsigned offset (whose width is 1, 
2, 3, 4, 5, or 11 bits, respectively) to the base of the range, using the 
following algorithm and fixed probability tables"

It seems that the dq[k] is within range +-2047 the "ACQLookup" could be from 4 
to 284, the dequantized value could be > 2047. 

Is there a limitation on the input of IDCT?

Yes, we modified the encoder mainly to force some out of bound MV to qualify 
our MV pullback logic and did not touch the part for quantization or 
coefficient generation.

Regards,

Olive

Original comment by olive...@yahoo.com on 29 Sep 2010 at 6:41

GoogleCodeExporter commented 9 years ago

From section 14.2, it seems that IDCT input is within signed 16 bit. Is this 
correct.

Thanks,

Original comment by olive...@yahoo.com on 29 Sep 2010 at 6:55

GoogleCodeExporter commented 9 years ago

The IDCT input is signed 12 bits, ie [-2048,2047]. It's syntactically possible 
to code a coefficient that would exceed this range when dequantized, but this 
is semantically incorrect. An encoder that produces this condition is invalid 
or the data is otherwise corrupt, so the C not matching the assembly in this 
case isn't a problem.

Original comment by jkoles...@google.com on 17 Nov 2010 at 1:27

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

Original comment by albe...@google.com on 8 Mar 2012 at 12:10

Added labels: Type-Bug
Removed labels: Type-Defect

shshankjain / webm

Possible issue with SSE2 implementation of IDCT functions #180