jkbonfield / io_lib

Staden Package "io_lib" (sometimes referred to as libstaden-read by distributions). This contains code for reading and writing a variety of Bioinformatics / DNA Sequence formats.
Other
36 stars 15 forks source link

Scramble GCs references too eagerly and subsequently fails #46

Closed klmr closed 2 years ago

klmr commented 2 years ago

When running Scramble with multiple threads on specific CRAM input, decoding fails non-deterministically with an error message similar to:

Failed to populate reference for id 96
Unable to fetch reference #96 112359..112505
Failed to populate reference for id 96
Unable to fetch reference #96 112334..112505
Failed to populate reference for id 96
Unable to fetch reference #96 112316..112494
Failed to populate reference for id 96
Unable to fetch reference #96 112299..112483
Slice decode failure
Failed to decode sequence

(To clarify, the CRAM file is not missing any references, and Scramble is being called with the correct reference file.)

The reference IDs differ for each run, as does the number of missing references (i.e. it’s not always the same for all slices, as in the output above).

The following steps reproduce this issue:

  1. Download and build io_lib 1.14.14:
    curl -JLO 'https://github.com/jkbonfield/io_lib/releases/download/io_lib-1-14-14/io_lib-1.14.14.tar.gz'
    tar xfz io_lib-1.14.14.tar.gz && rm io_lib-1.14.14.tar.gz
    cd io_lib-1.14.14/
    ./configure
    make -j
    cd ..
  2. Run scramble with multiple threads on a CRAM input file with an (indexed) reference FASTA file:
    io_lib-1.14.14/progs/scramble \
        -t 10 -I CRAM -O BAM -m \
        -r GCA_000001405.15_GRCh38_full_analysis_set.fna \
        NovaSeq-S2-NA12878-GCA_000001405.15-full-recaled-quant.cram \
        output.bam

(Ubuntu 18.04.2, x86_64, tried several versions of GCC ≤7.5.0)

Debugging reveals that a reference which is still needed has already been evicted through calls to cram_ref_decr, and the reference’s length has been set to 0. Disabling cram_ref_decr_locked (by replacing the function with an empty stub) seems to fix the issue.

Some additional observations:

jkbonfield commented 2 years ago

Thanks for the bug report, and ugh I hate threading errors! Out of curiosity does samtools view with threading have the same issue? Samtools CRAM code was copied from io_lib, and I also later migrated io_lib's threading pool. However a number of bugs were fixed which may not have got back ported to io_lib. The first step is probably to make the same thread pool fixes.

If you can reproduce this with a public CRAM file it would be very helpful. I'll try this myself too when I've finished catching up with a couple weeks of email.

klmr commented 2 years ago

As far as I can tell samtools view works fine on this file, even with multiple threads.

The file uses public data, so I could provide it. Do you have somewhere where I could upload the file?

jkbonfield commented 2 years ago

I don't have any Sanger provided storage for anonymous uploads any more, so can't host things like this myself. Sorry.

However for now I'll experiment with some public NA12878 data (aligned against GRCh38). I'm assuming the final step in producing the CRAM was BQSR, in which case it may differ to my own Samtools created CRAMs in what data type goes where, but we'll see.

I'm running some experiments now to try and reproduce it. Note you can also do a full recompile using make CC="gcc -fsanitize=thread" which should enable some basic thread error checking. It'll spot things such as attempting to access the same memory address from multiple threads where it's been written by one and read from another without an intervening thread lock. Sometimes it'll turn an intermittent error (time based) into a 100% failure, which aids reducing the problem down to a manageable size and also gives more confidence that the fix is genuine. (It's considerably slower though.)

jkbonfield commented 2 years ago

I should also add, that now I believe modern samtools/htslib to be as performant if not in some cases more performant than scramble. This definitely wasn't the case when Scramble was first written, but over the years I've managed to port over the main speed improvements. So for production work my recommendation is now to just use samtools.

I'm keeping scramble as a test bed for new changes as it's under my control and I can release things sooner and in my own time frame, but it's probably now considered the more experimental tool instead of production. (That said I do aim to fix bugs.)

jkbonfield commented 2 years ago

I can reproduce this locally.

For reference, the way to get it to fail quickly is as follows:

  1. One multi-threaded pass and note the error it reports, eg:
Failed to populate reference for id 108
  1. Use zcat file.cram.crai (the index is just gzipped text) to find the line numbers corresponding with that reference id:
$ zcat NA12878.final.cram.crai |egrep -n '^108'
75775:108   1   3599    15101747918 565 318869
75776:108   3450    1904    15102067381 645 391560
  1. Use cram_filter to pull out a subset of slices, in this case a few either side of the one reporting an error so we have enough data for concurrent decoding.
$ cram_filter -n 75700-75800 NA12878.final.cram _2.cram
  1. Then we can rapidly and repeatedly test the small _2.cram file. As expected, in this case it's the region of GRCh38 which consists of 100s of small alt-contigs, with very rapidly changing references. Note I also had to disable the REF_PATH environment variable so it cannot find references via other means, implying this is in the fasta handling code. It's not something I test so often locally as we normally access our references via their md5sum instead.

Anyway, the thread sanitizer, sadly, doesn't report anything. So this is some algorithmic issue rather than incorrect thread locking. Anyway reproducing the bug is (hopefully!) the hardest part solved. Thanks for reporting this.

jkbonfield commented 2 years ago

By bisecting the htslib code, which worked on this file, I managed to figure out the specific commit that fixed this issue and have applied it here.

Thanks for reporting the bug. At some point I'll need to do a new release, but I don't have a time line for that currently.