samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
649 stars 240 forks source link

test_vcf_merge failures on arm 32-bit with FPU: Bus error #2036

Closed emollier closed 9 months ago

emollier commented 10 months ago

Greetings,

Following recent bump of bcftools to version 1.18 in Debian sid, the build server for armhf reported 44 test failures which from quick inspection are all Bus errors. Full build log of bcftools for armhf is available on buildd.debian.org.

The relevant portion of the buildd log looks typically like:

test_vcf_merge:
    /<<PKGBUILDDIR>>/bcftools merge --no-version --force-samples /tmp/YVqRgiAYOP/merge.LPL.a.vcf.gz /tmp/YVqRgiAYOP/merge.LPL.b.vcf.gz /tmp/YVqRgiAYOP/merge.LPL.c.vcf.gz
bash: line 1: 3629434 Bus error               ( /<<PKGBUILDDIR>>/bcftools merge --no-version --force-samples /tmp/YVqRgiAYOP/merge.LPL.a.vcf.gz /tmp/YVqRgiAYOP/merge.LPL.b.vcf.gz /tmp/YVqRgiAYOP/merge.LPL.c.vcf.gz ) 2> /tmp/YVqRgiAYOP/tmp.e > /tmp/YVqRgiAYOP/tmp.o

    Non-zero status 135
.. failed ...

The issue does not appear on arm 64-bit (armv8+). The issue does not appear on arm 32-bit without FPU (armv6). It looks specific to arm 32-bit with FPU (armv7 "hard float"). I could reproduce the same test failures on Qemu. From quick glimpse in the build log, only test_vcf_merge is affected.

For information, Étienne.

emollier commented 9 months ago

Hi, if that helps, I did some bisection this evening which pointed to f2d2fdf8faa8098bc49072571596b9ab822ef7a6.

daviesrob commented 9 months ago

That might help, thanks. I'm having trouble reproducing on QEMU (qemu-system-arm -M virt) though. Could you give me some hints as to your set-up, please?

At the moment, lscpu on my emulated box says:

Architecture:           armv7l
  Byte Order:           Little Endian
CPU(s):                 4
  On-line CPU(s) list:  0-3
Vendor ID:              ARM
  Model name:           Cortex-A15
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r2p1
    BogoMIPS:           125.00
    Flags:              half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv
                        4 idiva idivt vfpd32 lpae evtstrm
emollier commented 9 months ago

According to Debian Architectures Memo, your CPU selection should cover the armhf requirements.

I'm running qemu-system-arm through Linux binfmt support, so effectively change root to a Debian armhf system and work from here on an x86_64 box without spawning an entire virtual machine. I suspect the buildd is running a variant of this, but on top of aarch64 instead of x86_64, to avoid qemu in the equation.

daviesrob commented 9 months ago

I'm still not getting any bus errors, but I'm using qemu-system and not qemu-user.

Would it be possible to upload your bcftools and libhts.so files here so I could inspect them and try running them myself? Also, a core file would be very useful if you've got one.

emollier commented 9 months ago

You will find my artifacts attached below, they include the bcftools and libhts.so files: artifacts.zip

I didn't capture a core file, and I (re)discovered qemu-user didn't play well with gdb, so couldn't trace the issue more accurately.

daviesrob commented 9 months ago

Thanks for the binaries, although I'm afraid I'm still not getting bus errors when running them in my armv7 VM. I may have to try getting qemu user mode to work, to see if it makes a difference.

One odd thing I noticed was that your bcftools binary didn't report a version number, and gave test results that looked more like those from develop than 1.18. Which version did you check out and build?

emollier commented 9 months ago

Oops, I was giving a try to the latest version in develop branch just in case, and I accidently sent that build. I actually checked out both 1.18 and develop, and had the bus errors on both, sorry for my confusing upload. Just in case it changes something, I uploaded the proper bcftools build from 1.18 checkout (which still raised bus errors in my environment); I'm not expecting it changes much the outcome on your end at this point though.

bcftools.zip

I suspect the main difference might stem from our respective testbeds. I'm using the distribution Debian for architecture armhf, but the very similar Debian armel is not affected, neither was the Debian arm64 variant. But you're probably already running a Debian or derivative armhf distribution. The buildd is different and I believe it runs an armhf distribution on top of aarch64.

I wonder, what would be the outcome on your end of attempting to run the armhf distribution on top of the aarch64 emulator?

daviesrob commented 9 months ago

I've tried armhf binaries on an aarch64 QEMU virtual machine, and it's still working for me. I may see if I can find a real cpu to try it on, in case that makes a difference.

emollier commented 9 months ago

Hi,

I have tried some printf based debugging in the meantime, and I think I could pinpoint something in the BRANCH macro of the _merge_formatfield function. I must apologize in advance, for it is hard to describe for me and I may be confusing.

In the test cases I isolated, the bus error occurred when _merge_formatfield hits the case BCF_BT_FLOAT. In such situation, the src is declared as a float pointer, type casted from _fmtori->p. However, _fmtori is of type _bcf_fmtt, and from htslib/vcf.h, _fmtori->p thus points to a _uint8t which is shorter than a float:

typedef struct bcf_fmt_t {
    int id;             // id: numeric tag id, the corresponding string is bcf_hdr_t::id[BCF_DT_ID][$id].key
    int n, size, type;  // n: number of values per-sample; size: number of bytes per-sample; type: one of BCF_BT_* types
    uint8_t *p;         // same as vptr and vptr_* in bcf_info_t below
    uint32_t p_len;
    uint32_t p_off:31, p_free:1;
} bcf_fmt_t;

Under 32-bit aligned adresses, this is not a problem and the program goes on:

<<DEBUG>> sizeof(*(fmt_ori->p)) = 1
<<DEBUG>> fmt_ori->p = 0x510ca0, float = 0.000000
<<DEBUG>> src float is 3.26083e-307

But with unaligned adresses, the dereference of the type cast src fails:

<<DEBUG>> sizeof(*(fmt_ori->p)) = 1
<<DEBUG>> fmt_ori->p = 0x50ffd6, float = 0.000000
qemu: uncaught target signal 7 (Bus error) - core dumped

Indeed, 0x50ffd6 is off by 2 bytes. As a side note I'm a bit bugged the plain _*(fmtori->p) derefence to print the floating point value doesn't trip SIGBUS; perhaps this is normal if printf read each byte one by one.

From quick attempt, modifying the htslib to promote p to type _uint32t doesn't look trivial (nor does it look to be the right change to preserve ABI compatibility), so I couldn't test my hypothesis that it would fix the issue. The nearest hunks affecting this BRANCH macro happen later than the segments affected by the bus error, so this hasn't helped pointing a finger to the wrong statement yet. But we're getting closer.

Thank you for your time, I feel like I demanded you a lot this week.

In hope this helps, Étienne.

daviesrob commented 9 months ago

Thanks for the analysis. I've happily also managed to reproduce this on my Raspberry Pi, and found it stopped at the point you've indicated. Evidently the real hardware is a bit more strict than the qemu emulated machine. Hopefully it shouldn't be too difficult to fix, once I've worked out what's going on in the huge BRANCH macro.

emollier commented 9 months ago

Impressive, thank you! I could confirm that d45cced did fix the test failures in my context too. The test suite went through when your patch is applied on top of version 1.18. :)

daviesrob commented 9 months ago

Excellent, I've turned it (with a minor adjustment) into pull request #2042.