uclouvain / openjpeg

Official repository of the OpenJPEG project
Other
971 stars 456 forks source link

opj_compress fails to compress lossless on gcc/x86 (-m32) #571

Closed malaterre closed 9 years ago

malaterre commented 9 years ago

I am having an issue with OPJ 1.5 and the following file:

http://gdcm.sourceforge.net/thingies/lossless_issue.pgm

It looks as if I cannot do a round-trip lossless compression / decompression, within a 32bits schroot. Everything is fine within a 64bits schroot system (debian jessie if that matter).

malaterre commented 9 years ago

Steps from debian/jessie i386:

$ image_to_j2k -i lossless_issue.pgm -o lossless_issue.j2k $ j2k_to_image -i lossless_issue.j2k -o lossless_issue.opj.pgm $ vim -b lossless_issue.opj.pgm -> remove comment $ vbindiff lossless_issue.pgm lossless_issue.opj.pgm

malaterre commented 9 years ago

I can reproduce the exact same behavior using openjpeg 2.1.0 (as packaged in debian).

malaterre commented 9 years ago

It seems the compression step is not working as expected:

$ crc32 *
49d5c244    lossless_issue.pgm
c611132c    lossless_issue.sid32.j2k
a913d7c3    lossless_issue.sid32.kdu.pgm
a913d7c3    lossless_issue.sid32.pgm
9922c6b1    lossless_issue.sid.j2k
49d5c244    lossless_issue.sid.kdu.pgm
49d5c244    lossless_issue.sid.pgm
malaterre commented 9 years ago

The good news is that if I run:

$ valgrind  opj_compress -i lossless_issue.pgm -o lossless_issue.sid32.j2k

Everything works as expected. So valgrind is doing some kind of initialisation (rounding?) step that is missing in normal code executition.

malaterre commented 9 years ago

If I use now clang-3.5 from my sid 32bits chroot everything works nicely. So the issue is not within the libc and such, but rather in the code generated by gcc.

malaterre commented 9 years ago

gcc-4.8, gcc-4.9 and gcc-5.2 all have the same symptoms

malaterre commented 9 years ago

If I recompile openjpeg with -fsanitize=undefined I get:

/home/mathieu/tmp/opj-bug/openjpeg/src/lib/openjp2/t1.c:1517:28: runtime error: left shift of negative value -128

ref:

https://github.com/uclouvain/openjpeg/blob/master/src/lib/openjp2/t1.c#L1517

malaterre commented 9 years ago

EVen if I fix the code using:

tiledp[tileIndex] *= (1 << T1_NMSEDEC_FRACBITS);

it still fails

malaterre commented 9 years ago

Now if I compile the whole code using gcc and then do:

(cd /home/mathieu/tmp/opj-bug/openjpeg/bin/src/lib/openjp2 && /usr/bin/clang  -Dopenjp2_EXPORTS -O0   -g -fPIC -I/home/mathieu/tmp/opj-bug/openjpeg/bin/src/lib/openjp2    -o CMakeFiles/openjp2.dir/tcd.c.o   -c /home/mathieu/tmp/opj-bug/openjpeg/src/lib/openjp2/tcd.c)

the code works as expected

malaterre commented 9 years ago

It appears that the issue is within this function opj_tcd_makelayer:

https://github.com/uclouvain/openjpeg/blob/master/src/lib/openjp2/tcd.c#L218

malaterre commented 9 years ago

To fix the symptoms one could use excess precision using: -msse2 -mfpmath=sse

malaterre commented 9 years ago

It is still not clear why such minor change leads to a lossy compressed stream, this must only be the tip of the iceberg.

malaterre commented 9 years ago

So the patch is simply, replace:

if (dd / dr >= thresh)

with:

OPJ_FLOAT64 div;
div = dd / dr;
if (div >= thresh)

Then compile the code with -ffloat-store to remove excess precision.

malaterre commented 9 years ago

The gcc documentation states that -fexcess-precision=standard should be enough but not in this case. We are still required to use -ffloat-store which feels like a hack.

malaterre commented 9 years ago

Confirmed by upstream gcc dev here. We really need to understand why the code is so sensible to excess precision.

mayeut commented 9 years ago

@malaterre,

I had a look at the piece of code that's mentioned here. In this specific case, for one of the computed value (dd/dr "==" threshold), the comparison is true with -ffloat-store & false without it.

This can be shown easily with this piece of code (with a bit of context):

  if (!dr) {
    if (dd != 0)
      n = passno + 1;
    continue;
  }
  if ((dd / dr) < 0.45) {
    printf("%.32f - %.32f\n", (dd / dr), thresh);
  }

  if ((dd / dr) >= thresh) {
    n = passno + 1;
    if ((dd / dr) < 0.45) {
      printf("True compare\n");
    }
  }
}

layer->numpasses = n - cblk->numpassesinlayers;

We should not rely on "==" comparison for floats.

mayeut commented 9 years ago

I'll add this specific image to the test suite if needed

malaterre commented 9 years ago

Just in case my series of comments were not clear:

And finally as discussed with Antonin, I fail to understand this piece of code. So removing the excess precision using compiler specific option is just plain wrong. One would need to understand why this piece of code is so damn sensible to excess precision.

Thanks for adding the dataset to the test suite !

malaterre commented 9 years ago

Should I re-open this issue for opj 1.5 branch ?

mayeut commented 9 years ago

The issue can only be affected to one milestone... I think a new one shall be created or maybe just list this one it in #574 .

vinc17fr commented 8 years ago

Note that even if you remove the excess precision, your code might still be affected by double rounding (which occurs with a probability of 1/2048 on random data). So, you really need to understand the code and how it can be affected by floating-point rounding errors. The bug was closed by doing "thresh - (dd / dr) < DBL_EPSILON" but one question is whether DBL_EPSILON is sufficient, i.e. what is the maximum error on thresh - (dd / dr) that you can expect?

malaterre commented 8 years ago

Salut Vincent ! Indeed I had seen this at least once through a funky example. Do you believe that simply increasing the precision for DBL_EPSILON will make the issue go away ?

vinc17fr commented 8 years ago

Depending on the context, a bound like DBL_EPSILON or something larger may or may not be correct. First, the code should be commented so that the reader can know what it tries to do, for instance some mathematical specification. About the floating-point implementation: The variable dr is an integer, so that I suppose that it is exact. The variables dd and thresh are floating-point numbers. There are two questions about them: What is the range of thresh and dd / dr? What is the floating-point error bound on the variables dd and thresh? Note that the current code may be regarded as suspicious, because the error bound DBL_EPSILON is an absolute one, while a relative one may be needed. For instance, if thresh and dd / dr can be large (or small) compared to 1, then the bound DBL_EPSILON does not make much sense. Hence my question about the range.