rust-bio / rust-htslib

This library provides HTSlib bindings and a high level Rust API for reading and writing BAM files.
MIT License
301 stars 77 forks source link

Certain BAM files cause silent failures or segfaults #156

Closed wjv closed 3 years ago

wjv commented 5 years ago

Background

According to the format specification, a BAM file can store reference sequence information in two places in its headers:

This is clearly bad. The refseqs stored in these two places should not disagree, but what if they do? The spec doesn't cover that, and it is left up to individual tools and libraries.

When rust-htslib fails

Consider a trival program that reads in a BAM file, iterates over its records, and writes it to an output BAM file:

use rust_htslib::bam;
use rust_htslib::bam::Read;

fn main() {
    let mut infile = bam::Reader::from_path("input.bam").unwrap();
    let header = bam::Header::from_template(infile.header());
    let mut outfile = bam::Writer::from_path("output.bam", &header, bam::Format::BAM).unwrap();

    let mut record = bam::Record::new();
    while let Ok(true) = infile.read(&mut record) {
        outfile.write(&record).unwrap();
    };
}

If you give this program an input BAM file that was written by a tool based on Biohazard (which has no @SQ fields in the SAM header) it will fail silently. Its output will be a truncated BAM file containing only the headers, yet it will exit cleanly with a 0 exit code. 😳

If you change the Rust program to use the bam::Format::SAM output format instead and give it the same BAM as input, it will segfault.

Note that the input BAM file created by Biohazard is technically spec-compliant, and may be processed by samtools. A (ridiculous) workaround is to first roundtrip the input BAM file via SAM using samtools:

samtools view -h broken.bam | samtools view -b -o fixed.bam

How should the issue be addressed?

I can see how this could be subject to differing opinions. However, at the very least, rust-htslib should probably never fail silently(!)

The rest of the issue isn't so much a bug in rust-htslib as it is the fact that the BAM format isn't clearly specified. My suggestions/questions:

Finally, this issue is probably also the root cause of #134 posted by @slazicoicr.

dlaehnemann commented 5 years ago

Wow, nice diagnosis. Failing silently and with a 0 exit status is indeed a bad idea, and should be fixed whenever possible. If you would like to suggest a fix, I think that would be welcomed by everybody. If you don't have the time or don't feel comfortable, we could see if somebody else can find the time.

As for general header stuff, there is a huge change to the htslib header API coming up in the next major release -- they have recently merged lots of big PRs regarding that: https://github.com/samtools/htslib/pulls?utf8=%E2%9C%93&q=is%3Apr+%22%5Bheader+API%5D%22

So in the mid-term, rust-htslib will have to be adapted to that new API, which will hopefully be more consistent inherently, not requiring so much of a dance around different possible cases. But a short-term fix of the silent failure is still warranted, I'd say.

dlaehnemann commented 5 years ago

Ah, and my two cents regarding a fix:

wjv commented 5 years ago

Thanks for the response @dlaehnemann!

I did not immediately suggest a fix, firstly because it was only very late last night that I finally grokked what was going on here (having spent the day staring at a debugger) and secondly because any fix would need to cross the unsafe Rust/C boundary, which is indeed something I'm not very comfortable with. (And of course, I did want some input as to how others think this unusual issue should be resolved.)

That all said, I now have a much better grip on the rust-htslib codebase than I did 48 hours ago, and I have a fair idea as to where the changes need to be applied. But somebody more familiar with the code would definitely have to double-check my work.

I didn't know about the upcoming changes to htslib itself. That might indeed make this point moot in the long term.

dlaehnemann commented 5 years ago

Then I would say, feel free to suggest a fix if you can find the time -- no matter what comes around with the new header API, making header handling work sanely for rust-htslib as soon as possible seems important!

I'll be offline until next Monday, but will then gladly review any suggested changes or dig in further if I can be of any help. And wherever it leaves my comfort zone, I'll just bubble it up further to more knowledgeable people! ;)

johanneskoester commented 5 years ago

Thanks for the nice analysis, I wasn't aware of that issue at all. Looking forward to your PR and happy to review.

brainstorm commented 4 years ago

@wjv Would you mind running your test bam file against master? As you might have noticed we bumped up to upstream's htslib 1.10.2, so extra testing (including your excellent dissection on this issue) would be very welcome!

slazicoicr commented 4 years ago

I checked the header of the bam file that did not work in #134 with 0.31.0 and it is still empty, meaning that this issue persists.

Trying to grok this better, I think the root issue of the missing raw binary encoding is the bam::Header, which does not correctly represent sam_hdr_t (https://github.com/rust-bio/rust-htslib/blob/master/hts-sys/linux_prebuilt_bindings.rs#L8128-L8139). This is because HeaderView.as_bytes (https://github.com/rust-bio/rust-htslib/blob/master/src/bam/mod.rs#L969) is badly named, as only the sam_hdr_t.text bytes are returned (there is a comment on the function recognizing that fact). This means that bam::Header lacks the n_targets, l_text, and target_name necessary display and correctly reconstruct the header.

i think this should be approached in two phases:

jch-13 commented 3 years ago

The new header API indeed seems to help! The text field has been deprecated in favor of the binary representation (https://github.com/samtools/htslib/blob/develop/NEWS#L741-L748). By using sam_hdr_str() instead of accessing the text directly, the text will be re-created based on the information in the binary representation. However, this is only done when an internal dirty flag is set and that's only set when a header-modifying function is called. So the problem persists even if we switch to sam_hdr_str().

A simple solution would be to call a htslib function which invalidates the text field, for example sam_hdr_line_index() (https://github.com/samtools/htslib/blob/85b0a7248dc0bc049df93edb88611d0735b649ed/htslib/sam.h#L670-L678). If this function is called when a BAM file is loaded by bam::Reader, subsequent calls to sam_hdr_str() yield text representations that are in sync with the binary one: https://github.com/rust-bio/rust-htslib/compare/master...jch-13:fix/bam_silent_header_fail This has been tested with the BAM file from #134 as well as a file created by software from the biohazard-tools suite.

I'll quickly file a PR if this is considered a valid fix. :)

Edit: We don't even need to modify the header, it's sufficient to query an entry.