Closed klmr closed 2 years ago
Thanks for the bug report, and ugh I hate threading errors! Out of curiosity does samtools view with threading have the same issue? Samtools CRAM code was copied from io_lib, and I also later migrated io_lib's threading pool. However a number of bugs were fixed which may not have got back ported to io_lib. The first step is probably to make the same thread pool fixes.
If you can reproduce this with a public CRAM file it would be very helpful. I'll try this myself too when I've finished catching up with a couple weeks of email.
As far as I can tell samtools view
works fine on this file, even with multiple threads.
The file uses public data, so I could provide it. Do you have somewhere where I could upload the file?
I don't have any Sanger provided storage for anonymous uploads any more, so can't host things like this myself. Sorry.
However for now I'll experiment with some public NA12878 data (aligned against GRCh38). I'm assuming the final step in producing the CRAM was BQSR, in which case it may differ to my own Samtools created CRAMs in what data type goes where, but we'll see.
I'm running some experiments now to try and reproduce it. Note you can also do a full recompile using make CC="gcc -fsanitize=thread"
which should enable some basic thread error checking. It'll spot things such as attempting to access the same memory address from multiple threads where it's been written by one and read from another without an intervening thread lock. Sometimes it'll turn an intermittent error (time based) into a 100% failure, which aids reducing the problem down to a manageable size and also gives more confidence that the fix is genuine. (It's considerably slower though.)
I should also add, that now I believe modern samtools/htslib to be as performant if not in some cases more performant than scramble. This definitely wasn't the case when Scramble was first written, but over the years I've managed to port over the main speed improvements. So for production work my recommendation is now to just use samtools.
I'm keeping scramble as a test bed for new changes as it's under my control and I can release things sooner and in my own time frame, but it's probably now considered the more experimental tool instead of production. (That said I do aim to fix bugs.)
I can reproduce this locally.
For reference, the way to get it to fail quickly is as follows:
Failed to populate reference for id 108
$ zcat NA12878.final.cram.crai |egrep -n '^108'
75775:108 1 3599 15101747918 565 318869
75776:108 3450 1904 15102067381 645 391560
$ cram_filter -n 75700-75800 NA12878.final.cram _2.cram
Anyway, the thread sanitizer, sadly, doesn't report anything. So this is some algorithmic issue rather than incorrect thread locking. Anyway reproducing the bug is (hopefully!) the hardest part solved. Thanks for reporting this.
By bisecting the htslib code, which worked on this file, I managed to figure out the specific commit that fixed this issue and have applied it here.
Thanks for reporting the bug. At some point I'll need to do a new release, but I don't have a time line for that currently.
When running Scramble with multiple threads on specific CRAM input, decoding fails non-deterministically with an error message similar to:
(To clarify, the CRAM file is not missing any references, and Scramble is being called with the correct reference file.)
The reference IDs differ for each run, as does the number of missing references (i.e. it’s not always the same for all slices, as in the output above).
The following steps reproduce this issue:
scramble
with multiple threads on a CRAM input file with an (indexed) reference FASTA file:(Ubuntu 18.04.2, x86_64, tried several versions of GCC ≤7.5.0)
Debugging reveals that a reference which is still needed has already been evicted through calls to
cram_ref_decr
, and the reference’slength
has been set to 0. Disablingcram_ref_decr_locked
(by replacing the function with an empty stub) seems to fix the issue.Some additional observations: