Open satta opened 7 years ago
FYI, we were able to work around this by adding -fno-strict-aliasing
to CFLAGS
.
Which one does it fix, the make test
, or valgrind
?
The problem with make test
is a known one, caused by BCF's rather unfortunate (ab)use of signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to completely fix it, although #485 helped a bit.
We've seen problems with zlib and valgrind before, but never in crc32
. Do you know if it happens with the latest zlib (1.2.11)? I'll have a go to see if I can reproduce.
Which one does it fix, the make test, or valgrind?
The failing test. I was assuming the valgrind issue had to do with it, but then again it might not.
The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to completely fix it, although #485 helped a bit.
OK. Thanks for the insight.
We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it happens with the latest zlib (1.2.11)?
I haven't tried 1.2.11, 1.2.8 is the latest in Debian unstable.
It looks like valgrind
has spotted a real problem. write_format_values()
in test-vcf-api.c
does this:
float test[4];
bcf_float_set_missing(test[0]);
test[1] = 47.11f;
bcf_float_set_vector_end(test[2]);
bcf_update_format_float(hdr, rec, "TF", test, 4);
bcf_write1(fp, hdr, rec);
So it allocates an array of four values, sets three of them and then updates the record with four. This means the last item in the record is undefined, triggering the warning when it's eventually written out.
When the data is checked (in check_format_values()
), the undefined value is converted to vector_end as bcf_get_format_values()
ignores anything after a vector_end marker and sets all remaining output values to vector_end.
The fix in this case is trivial (just set test[3] to any value). @pd3 may want to comment on whether bcf_update_format()
should also set the appropriate vector_end values after it's seen one in it's input.
@daviesrob bcf_update_format()
uses your serialize_float_array()
which calls float_to_le()
, so no special handling of vector_end
values should be required.
@pd3 The problem is the values that come after the vector_end. Should they all be written as vector_end
too, which is how they end up when read back in? Or possibly as zeros, which might compress better if there's a long run of them?
As the reading side ignores all of the data after vector_end
, it would seem sensible that the writing side should too?
The specification requires that all values after vector_end
are filled with vector_end
values (if the sample has fewer values than the rest of the samples). The bcf_update_format()
function assumes that the caller provides output that is valid in this sense.
Ironically I added this test case, with the behavior I could find in the code comments, as part of PR #485 to make sure I didn't mess up when I got the code to work with MSVC, 32-bit. So the real problem (i.e. bad output produced on i386) might have been there for a while, undetected.
For the write_format_values
test case, I didn't realize that the vector should be padded with vector_end
values. That should be an easy fix though and not something that will make things break in production.
I believe the issues reported by valgrind related to CRC32 is a whole different problem.
I've reproduced the problem on Debian stretch i386. The problem seems to be that bcf_float_set
sometimes doesn't do what it's supposed to, if an earlier floating-point operation has put the FPU in an erroneous state.
This piece of code demonstrates the behavior.
float f;
bcf_float_set_missing(f);
hexdump(&f, sizeof(float));
bcf_float_set_vector_end(f);
hexdump(&f, sizeof(float));
Where hexdump is a small utility function that takes a void as input, casts it to a uint8_t, and prints it byte by byte. The correct output on a little-endian system is:
0100807f
0200807f
But depending on what state the FPU is in, i.e., what operations were executed before this code, the output can also be:
0100c07f
0200807f
where the signaling NaN has been converted to quiet.
Did anybody work on this bug in the last month?
No, mainly because it's not easy to fix. This signalling NAN issue arises due to reliance on behaviour which its not well-defined, especially on i386. Even if it's possible to fix it in one place, there's no guarantee that it won't pop up again somewhere else.
A full solution is likely to need changes in the way floating-point values are stored internally. This may also require changes to the ways in which they are stored and accessed. It's going to need some thought to get the right solution.
We were able to work around this again for GCC 7 with -fno-strict-aliasing -fno-code-hoisting
.
We were able to work around this again for GCC 7 with -fno-strict-aliasing -fno-code-hoisting.
In Debian with GCC 8, this was insufficient, so we're dropping the i386*
builds
Dropping i386*
htslib is proving to be quite messy on Debian, so I did an experimental build using GCC 7 with -fno-strict-aliasing -fno-code-hoisting. only for i386. However we get the same failures as before:
/<<PKGBUILDDIR>>/test/test-vcf-api /tmp/YQmwmwn57N/test-vcf-api.bcf
bcf_get_format_float didn't produce the expected output.
.. failed ...
test_vcf_sweep:
/<<PKGBUILDDIR>>/test/test-vcf-sweep /tmp/YQmwmwn57N/test-vcf-api.bcf
The outputs differ:
/<<PKGBUILDDIR>>/test/test-vcf-sweep.out
/<<PKGBUILDDIR>>/test/test-vcf-sweep.out.new
.. failed ...
A fix for this would be greatly appreciated!
Otherwise we'll have to disable i386
for the following packages:
blasr
trinityrnaseq
stacks
segemehl
samtools
sambamba
rsem
rna-star
reapr
qtltools
iva
ariba
python3-htseq
cnvkit
circlator
pbdagcon
pbbamtools
nanopolish
libvcflib-tools
bio-eagle
freebayes
fastqtl
delly
crac
blasr
bcftools
augustus
I'm afraid the only way I've found to reliably fix this using compiler flags is to use:
./configure CFLAGS='-g -O2 -msse -mfpmath=sse'
Packages that depend on htslib and work with vcf/bcf files would probably need to be built with the same options.
Unfortunately this means the packages won't run on some really old CPUs that Debian may still support (Pentium II era; Pentium III seems to be OK). I can't see anyone seriously running any of these on a CPU that old, but I can also see that Debain maintainers might not be too happy with forcing use of sse when the distribution claims not to need it.
Thank you @daviesrob , that is an acceptable solution for me and it appears to work: https://buildd.debian.org/status/package.php?p=htslib&suite=experimental
Can the equivalent of CFLAGS=-msse -mfpmath=sse
be added to the htslib build system automatically for i386?
In Debian with GCC 8, this was insufficient, so we're dropping the
i386*
builds
htslib 1.7-2 builds and passes its tests with GCC 8 and glibc 2.28, so is it possible there is some new test in 1.8 or 1.9 that is too sensitive to floating point precision?
I'm forwarding Debian bug 865012 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865012), which mentions failing tests on i386. I can reproduce these on a plain Debian stretch i386 install with the most recent htslib release 1.5:
It looks like the
test_vcf_sweep
failure is a consequence of thetest_vcf_api
test failing. That failure seems to be caused bytest-vcf-api.c:360
:in particular the
bcf_float_is_vector_end()
checks which are failing. Values read do not match the constants defined invcf.c
forbcf_float_vector_end
. Apparently there is also some amount of uninitialized memory handled while writing the formatted file in the first place:This is currently breaking htslib builds on Debian and will affect dependent packages if unresolved.
FYI, the suggested approach of setting
-ffloat-store
in CFLAGS did not help.