Open DonFreed opened 3 years ago
@DonFreed Do you still get the error if you manually allocate and fill in a read to iterate over the bam? Then copy to pass it to the channel.
@sitag Thanks for the response. Manually allocating multiple records and moving them to a second thread produces the same type of error. Manually allocating a single record, passing this record back and forth between threads and eventually dropping the record in the original thread seems to work, which is probably an acceptable workaround for my use-case.
Here's a gist with modifications to the earlier example code that tests both cases, https://gist.github.com/DonFreed/c42725df1e3f81cfbd7e527016d34d6d.
@DonFreed I suspect that it has to do with drop - there is a call to free
in there. Good to know something is working for you, though passing the same record back and forth may mean you are blocking on that record being available, so you are not getting the full parallelism you want. You could just use rayon
and turn the interator into parallel iterator and not worry about channels.
bam::record::Record
holds a Rc
(non-thread-safe smart pointer) to the corresponding BAM header. When I replace Rc
with the thread-safe alternative Arc
in rust-htslib
's src/bam/{record,mod}.rs
, the segfault disappears.
@jch-13 bam::record::Record
should not be marked sync
then.
@jch-13
bam::record::Record
should not be markedsync
then.
Send
and Sync
are implemented manually for Record
s here.
Simply replacing Rc
with Arc
would likely come with a performance penalty. So I'm not sure how to move on from here... Do we need a smart pointer there or would even a simple reference do?
There's always the option to remove the header reference. I've always found it a bit weird given the structure of a sam/bam/cram that every record should contain a ref to the header.
On the other hand I usually do my parallelization over chromsomes/contigs using the indexedreader. This saves the burden of sending records across channels. But this might not be viable for every genome..
@jch-13 Unless I am misunderstanding something, since Record
is using non threadsafe things, Send
and Sync
traits should not be implemented. That correctly conveys that the users of the api are responsible for ensuring thread safety, and we can avoid this situation.
Yes, you're absolutely right. However, being able to pass Record
s around between threads is quite a useful feature and it was there before the Rc
was added in 6c537a0934b211ae9ef08a8483091720df6ad676 / 3fa020da8b04de5d8323f1bc0e6332f3bf154408.
I think it's a bit painful to see a segfaulting issue go stale, so let me try to revive this. It seems clear that the current situation is invalid. I see three possible solutions also mentioned before:
bam::Record
not shareable.Rc<HeaderView>
with Arc<HeaderView>
and always pay the performance penalty.Rc<HeaderView>
from Record
. Breaking change, but introduced fairly recent.Anyone see another way out of this? Do we take a vote? I'm on 1 or 3, but mostly 3.
@veldsla I agree. As it stands the Send
/Sync
are not valid. My primary vote is for 3 as well, then 1. There may be performance benefits from having a smaller bam::Record
struct. HeaderView
may be needed to get reference sequence name from tid
so we will need to document how to get that, (c.f. https://github.com/rust-bio/rust-htslib/issues/288 )
2 is a no go.
I looked at the implications of removal. Mostly straightforward, except that it breaks the bio_types::genome::AbstractInterval
implementation that is present for bam::Record
.
The implementation is not really that well suited anyway. contig()
panics on no header or invalid UTF8. It also does UTF8 validation for every call to contig which makes it unsuitable for calling in a loop. range()
panics when .cigar_cached()
wasn't called on the record which is just bad form. (I also would still like to remove this cigar_caching as well, but let's postpone that discussion).
I have a branch with the changes, so a PR is trivial.
@johanneskoester, @pmarks, any thoughts?
@veldsla I wonder if we could collect more data on 2? My naive expectation is that it would be a negligible performance difference that would be difficult to observe. Might be interesting to do a benchmark before we rule out that option on the grounds of performance, but I do agree this would go against the "slow features should be opt-in" principle.
For example if decrementing the Arc is very fast compared to the rest of the Record::drop()
impl, then it seems like just using an Arc
makes the Record
somewhat more usable.
I'm also OK with 3 if that's what people prefer. 1 is a definite no-go.
@pmarks I did some tests using a simple program that reads every record in a single threaded loop from a 1Gb 25M reads bam file. Indeed it seems that the performance difference in that scenario is not too big. Depending on the number of read threads I saw very little change using 4 threads:
Benchmark #1: ./htslibrctest
Time (mean ± σ): 4.436 s ± 0.035 s [User: 20.640 s, System: 0.630 s]
Range (min … max): 4.392 s … 4.513 s 10 runs
Benchmark #2: ./htslibarctest
Time (mean ± σ): 4.435 s ± 0.030 s [User: 20.614 s, System: 0.651 s]
Range (min … max): 4.379 s … 4.477 s 10 runs
Summary
'./htslibarctest' ran
1.00 ± 0.01 times faster than './htslibrctest'
On a larger machine an upping the number of read threads to 20 a minor penalty becomes apparent:
Benchmark #1: ./htslibrctest
Time (mean ± σ): 2.664 s ± 0.051 s [User: 24.248 s, System: 1.336 s]
Range (min … max): 2.578 s … 2.721 s 10 runs
Benchmark #2: ./htslibarctest
Time (mean ± σ): 2.829 s ± 0.078 s [User: 24.122 s, System: 1.295 s]
Range (min … max): 2.737 s … 2.981 s 10 runs
Summary
'./htslibrctest' ran
1.06 ± 0.04 times faster than './htslibarctest'
Of course this is only the case when the file is cached. In practice these operations will probably be mostly disk-bound. That doesn't mean the Arc has no overhead, but we're just limited by the reader.
Having seen these results, I still don't change my preference. I think the header has no useful place in the Record and the sacrificing of the AbstractInterval
trait would be a good thing. Also when doing my naive s/Rc/Arc/
I saw: 17 occurrences of unsafe impl (Send|Sync)
for structs that often contain Rc
. So a decent unsafe audit is in order.
@pmarks @veldsla I don't think 25M reads represents the kind of work flow we deal with. For methylation data we regularly work with 1G reads; we expect data to get bigger not smaller. A couple of years back we looked at both a young rust-htslib, and a more mature bioD/sambamba, and went with bioD/sambamba because it was slightly more performant and that mattered, even though D came with tooling headaches. I think we should not sacrifice performance for ergonomics and abstraction.
I deliberately used a small file to make sure IO was less of an issue, but already it is clear that we are doing IO and that basically boils down to waiting. So unless you create your bam::Record's new from some other source the limiting factor will be the reading/block decompressing. We can still do a synthetic benchmark on a struct with an Arc vs a struct with (or without an Rc), and I'm sure the difference will be a lot more obvious, but in practice you probably won't observe it.
But yes, I think so too:
I think we should not sacrifice performance for ergonomics and abstraction
Reviving this issue again.
Personally, my vote is for 3. I would also think that 1 is a no-go as being unable to pass bam::Records between threads would limit the usefulness of the library. I'm not sure what the reasoning is for implementing bio_types::genome::AbstractInterval
for bam::Records, so it's hard for me to weigh the merits of 2 vs. 3.
Trying to revive again. What about adding a wrapper struct that contains a {bam,bcf}::Record
and an Arc
pointing to its header and only implementing AbstractInterval
for that? So you'd pay only for (atomic) reference counting when you actually do need the header reference? So basically option 3 with a workaround for those who need AbstractInterval
impls on records. It's not exactly elegant, though...
Another option would be to have the Arc/Send+Sync variant (or AbstractInterval?) behind a feature gate which isn't part of the default features.
Good idea, but wouldn't that mean you can't use both AbstractInterval
and Sync
simultaneously?
Hi reviving this as well again, is this still a problem nowadays? I tried to pass records and rayon and I think I am running into a similar issues:
let thread_results : Vec<String> ;
let (snd, rxv) = unbounded();
rayon::scope( |child| {
for record in bam.records(){
let snd_local = snd.clone();
child.spawn( move |_| {
let read = record.expect("ERROR: could not extract read information from record!");
let tmp_result = str::from_utf8(read.qname()).unwrap().to_string();
snd_local.send(tmp_result).expect("ERROR: thread could not communicate result!");
});
}
});
drop(snd);
thread_results = rxv.iter().collect();
dbg!(thread_results);
Executing it multiple times over a bam file succeeds in general but occasionally would generate a core dump
Okay this is pretty strange as it works flawlessly if I pass a copy of the header inside via:
rayon::scope( |child| {
for record in bam.records(){
let snd_local = snd.clone();
let _header_local = header.clone();
child.spawn( move |_| {
let read = record.expect("ERROR: could not extract read information from record!");
let tmp_result : String = str::from_utf8(read.qname()).unwrap().to_string();
snd_local.send(tmp_result).expect("ERROR: thread could not communicate result!");
});
}
});
drop(snd);
thread_results = rxv.iter().collect();
dbg!(thread_results);
The Rc issues were never fixed. Somebody needs to pick a solution discussed above. The cloning of the header probably avoids the illegal access.
You could also consider switching to noodles. It provides (async!) reader/writers for BAM and CRAM (but no unified record). It's still actively changing, but releases on crates.io and has a very responsive maintainer.
Thanks, I see. First I thought I would not mind at the moment just passing the header all the time as it worked well on small subsets. Turns out though that this makes my RAM explode very quickly in large BAM files :/ And obviously if I just reference the header view instead of cloning it then it never succeeds through the entire file but will eventually hit that bug.
Update: Tried to pass through a chunk iterator instead of the records but problem remains. I need tid and mtid and unfortunately subsequently the header. But it seems that whenever I try to do that I hit either the occasional Segmentation fault or if I clone the header I kill RAM :(
Reviving this issue again.
Personally, I will go for 3. Can we have a solution for this?
I made a draft PR (#345) trying to work around this particular soundness issue. Please let me know what you think.
What's the consensus here? It seems that Arc (option 3) requires the least code-change and @veldsla has shown that the performance hit is tolerable.
Is there some concurrency setup where this is not an issue? @sitag recommends using rayon
parallel iterator. Is there an exampl here?
Thank you for the very nice library!
I'm working on a tool that passes reads between threads. Unfortunately, the tool is producing non-deterministic segmentation faults and other memory errors. I've traced some of these issues back to rust-htslib, which will crash somewhat randomly when
bam::record::Record
s are passes between threads. I am not too familiar with this library, so I am wondering if this is the expected behavior?Here is a simplified example that can reproduce the crash:
Compiling with
RUSTFLAGS="-g" cargo build
and then running with a SAM passed through stdin produces the following:Re-running the same command will produce the crash in different parts of the program. I've attached a backtrace from one of the crashes: crash backtrace.txt