Failing test_vcf_api/test_vcf_sweep tests on i386

satta commented 7 years ago

I'm forwarding Debian bug 865012 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865012), which mentions failing tests on i386. I can reproduce these on a plain Debian stretch i386 install with the most recent htslib release 1.5:

vagrant@debian-9rc1-i386:/vagrant/htslib$ make test
test/hts_endian
test/fieldarith test/fieldarith.sam
test/hfile
test/test_bgzf test/bgziptest.txt
cd test/tabix && ./test-tabix.sh tabix.tst
Testing tabix...
PASS : $tabix -f -p vcf vcf_file.tbi.tmp.vcf.gz
[...]
test_vcf_api:
    /vagrant/htslib/test/test-vcf-api /tmp/KuypJQksGq/test-vcf-api.bcf
bcf_get_format_float didn't produce the expected output.

.. failed ...

test_vcf_sweep:
    /vagrant/htslib/test/test-vcf-sweep /tmp/KuypJQksGq/test-vcf-api.bcf

    The outputs differ:
        /vagrant/htslib/test/test-vcf-sweep.out
        /vagrant/htslib/test/test-vcf-sweep.out.new
.. failed ...

It looks like the test_vcf_sweep failure is a consequence of the test_vcf_api test failing. That failure seems to be caused by test-vcf-api.c:360:

           if (ret != 4 ||
            count < ret ||
            !bcf_float_is_missing(values[0]) ||
            values[1] != 47.11f ||
            !bcf_float_is_vector_end(values[2]) ||
            !bcf_float_is_vector_end(values[3]))
        {
            fprintf(stderr, "bcf_get_format_float didn't produce the expected output.\n");
            exit(-1);
        }

in particular the bcf_float_is_vector_end() checks which are failing. Values read do not match the constants defined in vcf.c for bcf_float_vector_end. Apparently there is also some amount of uninitialized memory handled while writing the formatted file in the first place:

vagrant@debian-9rc1-i386:/vagrant/htslib$ valgrind ./test/test-vcf-api test-vcf-api.bcf
==25115== Memcheck, a memory error detector
==25115== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==25115== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==25115== Command: ./test/test-vcf-api test-vcf-api.bcf
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x4843683: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x4843686: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x484368D: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x4843694: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
bcf_get_format_float didn't produce the expected output.
==25115==
==25115== HEAP SUMMARY:
==25115==     in use at exit: 136,576 bytes in 69 blocks
==25115==   total heap usage: 213 allocs, 144 frees, 857,261 bytes allocated
==25115==
==25115== LEAK SUMMARY:
==25115==    definitely lost: 0 bytes in 0 blocks
==25115==    indirectly lost: 0 bytes in 0 blocks
==25115==      possibly lost: 0 bytes in 0 blocks
==25115==    still reachable: 136,576 bytes in 69 blocks
==25115==         suppressed: 0 bytes in 0 blocks
==25115== Rerun with --leak-check=full to see details of leaked memory
==25115==
==25115== For counts of detected and suppressed errors, rerun with: -v
==25115== Use --track-origins=yes to see where uninitialised values come from
==25115== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 70 from 30)

This is currently breaking htslib builds on Debian and will affect dependent packages if unresolved.

FYI, the suggested approach of setting -ffloat-store in CFLAGS did not help.

satta commented 7 years ago

FYI, we were able to work around this by adding -fno-strict-aliasing to CFLAGS.

daviesrob commented 7 years ago

Which one does it fix, the make test, or valgrind?

The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to completely fix it, although #485 helped a bit.

We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it happens with the latest zlib (1.2.11)? I'll have a go to see if I can reproduce.

satta commented 7 years ago

Which one does it fix, the make test, or valgrind?

The failing test. I was assuming the valgrind issue had to do with it, but then again it might not.

The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to completely fix it, although #485 helped a bit.

OK. Thanks for the insight.

We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it happens with the latest zlib (1.2.11)?

I haven't tried 1.2.11, 1.2.8 is the latest in Debian unstable.

daviesrob commented 7 years ago

It looks like valgrindhas spotted a real problem. write_format_values() in test-vcf-api.c does this:

    float test[4];
    bcf_float_set_missing(test[0]);
    test[1] = 47.11f;
    bcf_float_set_vector_end(test[2]);
    bcf_update_format_float(hdr, rec, "TF", test, 4);
    bcf_write1(fp, hdr, rec);

So it allocates an array of four values, sets three of them and then updates the record with four. This means the last item in the record is undefined, triggering the warning when it's eventually written out.

When the data is checked (in check_format_values()), the undefined value is converted to vector_end as bcf_get_format_values() ignores anything after a vector_end marker and sets all remaining output values to vector_end.

The fix in this case is trivial (just set test[3] to any value). @pd3 may want to comment on whether bcf_update_format() should also set the appropriate vector_end values after it's seen one in it's input.

pd3 commented 7 years ago

@daviesrob bcf_update_format() uses your serialize_float_array() which calls float_to_le(), so no special handling of vector_end values should be required.

daviesrob commented 7 years ago

@pd3 The problem is the values that come after the vector_end. Should they all be written as vector_end too, which is how they end up when read back in? Or possibly as zeros, which might compress better if there's a long run of them?

As the reading side ignores all of the data after vector_end, it would seem sensible that the writing side should too?

pd3 commented 7 years ago

The specification requires that all values after vector_end are filled with vector_end values (if the sample has fewer values than the rest of the samples). The bcf_update_format() function assumes that the caller provides output that is valid in this sense.

anderskaplan commented 7 years ago

Ironically I added this test case, with the behavior I could find in the code comments, as part of PR #485 to make sure I didn't mess up when I got the code to work with MSVC, 32-bit. So the real problem (i.e. bad output produced on i386) might have been there for a while, undetected.

For the write_format_values test case, I didn't realize that the vector should be padded with vector_end values. That should be an easy fix though and not something that will make things break in production.

I believe the issues reported by valgrind related to CRC32 is a whole different problem.

anderskaplan commented 7 years ago

I've reproduced the problem on Debian stretch i386. The problem seems to be that bcf_float_set sometimes doesn't do what it's supposed to, if an earlier floating-point operation has put the FPU in an erroneous state.

This piece of code demonstrates the behavior.

float f;
bcf_float_set_missing(f);
hexdump(&f, sizeof(float));
bcf_float_set_vector_end(f);
hexdump(&f, sizeof(float));

Where hexdump is a small utility function that takes a void as input, casts it to a uint8_t, and prints it byte by byte. The correct output on a little-endian system is:

0100807f
0200807f

But depending on what state the FPU is in, i.e., what operations were executed before this code, the output can also be:

0100c07f
0200807f

where the signaling NaN has been converted to quiet.

mapreri commented 7 years ago

Did anybody work on this bug in the last month?

daviesrob commented 7 years ago

No, mainly because it's not easy to fix. This signalling NAN issue arises due to reliance on behaviour which its not well-defined, especially on i386. Even if it's possible to fix it in one place, there's no guarantee that it won't pop up again somewhere else.

A full solution is likely to need changes in the way floating-point values are stored internally. This may also require changes to the ways in which they are stored and accessed. It's going to need some thought to get the right solution.

ginggs commented 7 years ago

We were able to work around this again for GCC 7 with -fno-strict-aliasing -fno-code-hoisting.

mr-c commented 6 years ago

We were able to work around this again for GCC 7 with -fno-strict-aliasing -fno-code-hoisting.

In Debian with GCC 8, this was insufficient, so we're dropping the i386* builds

mr-c commented 5 years ago

Dropping i386* htslib is proving to be quite messy on Debian, so I did an experimental build using GCC 7 with -fno-strict-aliasing -fno-code-hoisting. only for i386. However we get the same failures as before:

    /<<PKGBUILDDIR>>/test/test-vcf-api /tmp/YQmwmwn57N/test-vcf-api.bcf
bcf_get_format_float didn't produce the expected output.

.. failed ...

test_vcf_sweep:
    /<<PKGBUILDDIR>>/test/test-vcf-sweep /tmp/YQmwmwn57N/test-vcf-api.bcf

    The outputs differ:
        /<<PKGBUILDDIR>>/test/test-vcf-sweep.out
        /<<PKGBUILDDIR>>/test/test-vcf-sweep.out.new
.. failed ...

https://buildd.debian.org/status/fetch.php?pkg=htslib&arch=i386&ver=1.9-7~0~i386~gcc7~1&stamp=1544497780&raw=0

A fix for this would be greatly appreciated!

Otherwise we'll have to disable i386 for the following packages: blasr trinityrnaseq stacks segemehl samtools sambamba rsem rna-star reapr qtltools iva ariba python3-htseq cnvkit circlator pbdagcon pbbamtools nanopolish libvcflib-tools bio-eagle freebayes fastqtl delly crac blasr bcftools augustus

daviesrob commented 5 years ago

I'm afraid the only way I've found to reliably fix this using compiler flags is to use:

./configure CFLAGS='-g -O2 -msse -mfpmath=sse'

Packages that depend on htslib and work with vcf/bcf files would probably need to be built with the same options.

Unfortunately this means the packages won't run on some really old CPUs that Debian may still support (Pentium II era; Pentium III seems to be OK). I can't see anyone seriously running any of these on a CPU that old, but I can also see that Debain maintainers might not be too happy with forcing use of sse when the distribution claims not to need it.

mr-c commented 5 years ago

Thank you @daviesrob , that is an acceptable solution for me and it appears to work: https://buildd.debian.org/status/package.php?p=htslib&suite=experimental

Can the equivalent of CFLAGS=-msse -mfpmath=sse be added to the htslib build system automatically for i386?

ginggs commented 5 years ago

In Debian with GCC 8, this was insufficient, so we're dropping the i386* builds

htslib 1.7-2 builds and passes its tests with GCC 8 and glibc 2.28, so is it possible there is some new test in 1.8 or 1.9 that is too sensitive to floating point precision?

samtools / htslib

Failing test_vcf_api/test_vcf_sweep tests on i386 #565