Passing `bam::record::Record` between threads causes a segfault

DonFreed commented 3 years ago

Thank you for the very nice library!

I'm working on a tool that passes reads between threads. Unfortunately, the tool is producing non-deterministic segmentation faults and other memory errors. I've traced some of these issues back to rust-htslib, which will crash somewhat randomly when bam::record::Records are passes between threads. I am not too familiar with this library, so I am wondering if this is the expected behavior?

Here is a simplified example that can reproduce the crash:

use std::error::Error;
use std::str;
use std::sync::mpsc::{self, Receiver};
use std::thread;

use rust_htslib::bam::{Read, Reader};
use rust_htslib::bam::record::Record;

fn sum_mapqs(rx: Receiver<Option<Record>>) -> Result<(), Box<dyn Error>> {
    let mut total_mapq = 0u64;
    loop {
        match rx.recv() {
            Ok(x) => {
                match x {
                    Some(read) => {
                        let mapq = read.mapq();
                        total_mapq = total_mapq.saturating_add(mapq as u64);
                    },
                    None => {  // No more data
                        println!("Total MapQ: {}", total_mapq);
                        return Ok(());
                    },
                }
            },
            Err(e) => {
                eprintln!("Error reciving data: {}", e);
            }
        }
    }
}

fn main() -> Result<(), Box<dyn Error>> {
    let mut bam = match Reader::from_stdin() {
        Ok(bam) => { bam },
        Err(e) => { return Err(Box::new(e)); },
    };

    // Initialize the writer thread
    let (tx, rx) = mpsc::channel();
    let writer = thread::spawn(move || {
        if let Err(e) = sum_mapqs(rx) {
            eprintln!("Error writing output - {}", e);
        }
    });

    for read in bam.records() {
        match read {
            Ok(read) => {
                eprintln!("Parsed read: {}", str::from_utf8(read.qname()).unwrap());
                if let Err(e) = tx.send(Some(read)) {
                    eprintln!("Error sending data to writer thread");
                    return Err(Box::new(e));
                }
            },
            Err(e) => {
                return Err(Box::new(e));
            },
        }
    }

    // Close the spawned thread
    let _ = tx.send(None);
    let _ = writer.join().unwrap();
    Ok(())
}

Compiling with RUSTFLAGS="-g" cargo build and then running with a SAM passed through stdin produces the following:

$ cat test.sam | target/debug/rust-htslib-crash
...
Parsed read: H203:185:D2990ACXX:4:1101:11561:5493
23559 Broken pipe             cat test.sam
23560 Segmentation fault      (core dumped) | target/debug/rust-htslib-crash

Re-running the same command will produce the crash in different parts of the program. I've attached a backtrace from one of the crashes: crash backtrace.txt

sitag commented 3 years ago

@DonFreed Do you still get the error if you manually allocate and fill in a read to iterate over the bam? Then copy to pass it to the channel.

DonFreed commented 3 years ago

@sitag Thanks for the response. Manually allocating multiple records and moving them to a second thread produces the same type of error. Manually allocating a single record, passing this record back and forth between threads and eventually dropping the record in the original thread seems to work, which is probably an acceptable workaround for my use-case.

Here's a gist with modifications to the earlier example code that tests both cases, https://gist.github.com/DonFreed/c42725df1e3f81cfbd7e527016d34d6d.

sitag commented 3 years ago

@DonFreed I suspect that it has to do with drop - there is a call to free in there. Good to know something is working for you, though passing the same record back and forth may mean you are blocking on that record being available, so you are not getting the full parallelism you want. You could just use rayon and turn the interator into parallel iterator and not worry about channels.

jch-13 commented 3 years ago

bam::record::Record holds a Rc (non-thread-safe smart pointer) to the corresponding BAM header. When I replace Rc with the thread-safe alternative Arc in rust-htslib's src/bam/{record,mod}.rs, the segfault disappears.

sitag commented 3 years ago

@jch-13 bam::record::Record should not be marked sync then.

jch-13 commented 3 years ago

@jch-13 bam::record::Record should not be marked sync then.

Send and Sync are implemented manually for Records here. Simply replacing Rc with Arc would likely come with a performance penalty. So I'm not sure how to move on from here... Do we need a smart pointer there or would even a simple reference do?

veldsla commented 3 years ago

There's always the option to remove the header reference. I've always found it a bit weird given the structure of a sam/bam/cram that every record should contain a ref to the header.

On the other hand I usually do my parallelization over chromsomes/contigs using the indexedreader. This saves the burden of sending records across channels. But this might not be viable for every genome..

sitag commented 3 years ago

@jch-13 Unless I am misunderstanding something, since Record is using non threadsafe things, Send and Sync traits should not be implemented. That correctly conveys that the users of the api are responsible for ensuring thread safety, and we can avoid this situation.

jch-13 commented 3 years ago

Yes, you're absolutely right. However, being able to pass Records around between threads is quite a useful feature and it was there before the Rc was added in 6c537a0934b211ae9ef08a8483091720df6ad676 / 3fa020da8b04de5d8323f1bc0e6332f3bf154408.

veldsla commented 3 years ago

I think it's a bit painful to see a segfaulting issue go stale, so let me try to revive this. It seems clear that the current situation is invalid. I see three possible solutions also mentioned before:

Remove the Send/Sync implementations and make bam::Record not shareable.
Replace the Rc<HeaderView> with Arc<HeaderView> and always pay the performance penalty.
Remove Rc<HeaderView> from Record. Breaking change, but introduced fairly recent.

Anyone see another way out of this? Do we take a vote? I'm on 1 or 3, but mostly 3.

sitag commented 3 years ago

@veldsla I agree. As it stands the Send/Sync are not valid. My primary vote is for 3 as well, then 1. There may be performance benefits from having a smaller bam::Record struct. HeaderView may be needed to get reference sequence name from tid so we will need to document how to get that, (c.f. https://github.com/rust-bio/rust-htslib/issues/288 ) 2 is a no go.

veldsla commented 3 years ago

I looked at the implications of removal. Mostly straightforward, except that it breaks the bio_types::genome::AbstractInterval implementation that is present for bam::Record.

The implementation is not really that well suited anyway. contig() panics on no header or invalid UTF8. It also does UTF8 validation for every call to contig which makes it unsuitable for calling in a loop. range() panics when .cigar_cached() wasn't called on the record which is just bad form. (I also would still like to remove this cigar_caching as well, but let's postpone that discussion).

I have a branch with the changes, so a PR is trivial.

@johanneskoester, @pmarks, any thoughts?

pmarks commented 3 years ago

@veldsla I wonder if we could collect more data on 2? My naive expectation is that it would be a negligible performance difference that would be difficult to observe. Might be interesting to do a benchmark before we rule out that option on the grounds of performance, but I do agree this would go against the "slow features should be opt-in" principle.

For example if decrementing the Arc is very fast compared to the rest of the Record::drop() impl, then it seems like just using an Arc makes the Record somewhat more usable.

I'm also OK with 3 if that's what people prefer. 1 is a definite no-go.

veldsla commented 3 years ago

@pmarks I did some tests using a simple program that reads every record in a single threaded loop from a 1Gb 25M reads bam file. Indeed it seems that the performance difference in that scenario is not too big. Depending on the number of read threads I saw very little change using 4 threads:

Benchmark #1: ./htslibrctest
  Time (mean ± σ):      4.436 s ±  0.035 s    [User: 20.640 s, System: 0.630 s]
  Range (min … max):    4.392 s …  4.513 s    10 runs

Benchmark #2: ./htslibarctest
  Time (mean ± σ):      4.435 s ±  0.030 s    [User: 20.614 s, System: 0.651 s]
  Range (min … max):    4.379 s …  4.477 s    10 runs

Summary
  './htslibarctest' ran
    1.00 ± 0.01 times faster than './htslibrctest'

On a larger machine an upping the number of read threads to 20 a minor penalty becomes apparent:

Benchmark #1: ./htslibrctest
  Time (mean ± σ):      2.664 s ±  0.051 s    [User: 24.248 s, System: 1.336 s]
  Range (min … max):    2.578 s …  2.721 s    10 runs

Benchmark #2: ./htslibarctest
  Time (mean ± σ):      2.829 s ±  0.078 s    [User: 24.122 s, System: 1.295 s]
  Range (min … max):    2.737 s …  2.981 s    10 runs

Summary
  './htslibrctest' ran
    1.06 ± 0.04 times faster than './htslibarctest'

Of course this is only the case when the file is cached. In practice these operations will probably be mostly disk-bound. That doesn't mean the Arc has no overhead, but we're just limited by the reader.

veldsla commented 3 years ago

Having seen these results, I still don't change my preference. I think the header has no useful place in the Record and the sacrificing of the AbstractInterval trait would be a good thing. Also when doing my naive s/Rc/Arc/ I saw: 17 occurrences of unsafe impl (Send|Sync) for structs that often contain Rc. So a decent unsafe audit is in order.

sitag commented 3 years ago

@pmarks @veldsla I don't think 25M reads represents the kind of work flow we deal with. For methylation data we regularly work with 1G reads; we expect data to get bigger not smaller. A couple of years back we looked at both a young rust-htslib, and a more mature bioD/sambamba, and went with bioD/sambamba because it was slightly more performant and that mattered, even though D came with tooling headaches. I think we should not sacrifice performance for ergonomics and abstraction.

veldsla commented 3 years ago

I deliberately used a small file to make sure IO was less of an issue, but already it is clear that we are doing IO and that basically boils down to waiting. So unless you create your bam::Record's new from some other source the limiting factor will be the reading/block decompressing. We can still do a synthetic benchmark on a struct with an Arc vs a struct with (or without an Rc), and I'm sure the difference will be a lot more obvious, but in practice you probably won't observe it.

But yes, I think so too:

I think we should not sacrifice performance for ergonomics and abstraction

DonFreed commented 3 years ago

Reviving this issue again.

Personally, my vote is for 3. I would also think that 1 is a no-go as being unable to pass bam::Records between threads would limit the usefulness of the library. I'm not sure what the reasoning is for implementing bio_types::genome::AbstractInterval for bam::Records, so it's hard for me to weigh the merits of 2 vs. 3.

jch-13 commented 3 years ago

Trying to revive again. What about adding a wrapper struct that contains a {bam,bcf}::Record and an Arc pointing to its header and only implementing AbstractInterval for that? So you'd pay only for (atomic) reference counting when you actually do need the header reference? So basically option 3 with a workaround for those who need AbstractInterval impls on records. It's not exactly elegant, though...

tedil commented 3 years ago

Another option would be to have the Arc/Send+Sync variant (or AbstractInterval?) behind a feature gate which isn't part of the default features.

jch-13 commented 3 years ago

Good idea, but wouldn't that mean you can't use both AbstractInterval and Sync simultaneously?

ebioman commented 2 years ago

Hi reviving this as well again, is this still a problem nowadays? I tried to pass records and rayon and I think I am running into a similar issues:

let thread_results : Vec<String> ;
let (snd, rxv) = unbounded();
 rayon::scope( |child| {
 for record in bam.records(){
     let snd_local = snd.clone();
     child.spawn( move |_| {
         let read = record.expect("ERROR: could not extract read information from record!");
         let tmp_result = str::from_utf8(read.qname()).unwrap().to_string();
         snd_local.send(tmp_result).expect("ERROR: thread could not communicate result!");
     });
    }
});
drop(snd);
thread_results = rxv.iter().collect();
dbg!(thread_results);

Executing it multiple times over a bam file succeeds in general but occasionally would generate a core dump

ebioman commented 2 years ago

Okay this is pretty strange as it works flawlessly if I pass a copy of the header inside via:

rayon::scope( |child| {
 for record in bam.records(){
     let snd_local = snd.clone();
     let _header_local = header.clone();
     child.spawn( move |_| {
         let read = record.expect("ERROR: could not extract read information from record!");
         let tmp_result  : String = str::from_utf8(read.qname()).unwrap().to_string();
         snd_local.send(tmp_result).expect("ERROR: thread could not communicate result!");
     });
    }
});
drop(snd);
thread_results = rxv.iter().collect();
dbg!(thread_results);

veldsla commented 2 years ago

The Rc issues were never fixed. Somebody needs to pick a solution discussed above. The cloning of the header probably avoids the illegal access.

You could also consider switching to noodles. It provides (async!) reader/writers for BAM and CRAM (but no unified record). It's still actively changing, but releases on crates.io and has a very responsive maintainer.

ebioman commented 2 years ago

Thanks, I see. First I thought I would not mind at the moment just passing the header all the time as it worked well on small subsets. Turns out though that this makes my RAM explode very quickly in large BAM files :/ And obviously if I just reference the header view instead of cloning it then it never succeeds through the entire file but will eventually hit that bug.

Update: Tried to pass through a chunk iterator instead of the records but problem remains. I need tid and mtid and unfortunately subsequently the header. But it seems that whenever I try to do that I hit either the occasional Segmentation fault or if I clone the header I kill RAM :(

Ebedthan commented 2 years ago

Reviving this issue again.

Personally, I will go for 3. Can we have a solution for this?

jch-13 commented 2 years ago

I made a draft PR (#345) trying to work around this particular soundness issue. Please let me know what you think.

brentp commented 1 year ago

What's the consensus here? It seems that Arc (option 3) requires the least code-change and @veldsla has shown that the performance hit is tolerable. Is there some concurrency setup where this is not an issue? @sitag recommends using rayon parallel iterator. Is there an exampl here?

rust-bio / rust-htslib

Passing `bam::record::Record` between threads causes a segfault #293