samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Inconsistency in samtools stats output - cram index issue? #1639

Closed jts closed 1 year ago

jts commented 1 year ago

Hi,

I recently ran into a strange problem where the output of samtools stats differed between two invocations that I expected to be equivalent. I was trying to generate the stats for a single chromosome and initially ran:

samtools stats example.cram chrM

The results reported far fewer reads than expected, so I re-ran the command providing the full range for this chromosome and got the expected number of reads:

samtools stats example.cram chrM:1-19154

By definition these two range specifiers should be the same so I spent a bit of time looking into this and I think I tracked it down to a difference in what cram_index_last and cram_index_last_query return. I've attached a minimal example (sim1.cram) to reproduce the problem. This is simply 500 full-length copies of a mtDNA sequence, with 1% errors introduced.

With this .cram file samtools stats sim1.cram chrM reports 257 total reads for this chromosome but samtools stats sim1.cram chrM:1-19154 reports 486.

Here's a snippet of code to dump the cram_index* returned by the two methods:

#include "htslib/cram/cram.h"
#include "htslib/cram/cram_index.h"
#include "htslib/hts.h"

int main(int argc, char** argv)
{
    if(argc < 2) {
        return 1;
    }

    cram_fd* cram = cram_open(argv[1], "r");
    cram_index_load(cram, argv[1], NULL);

    cram_index* last = cram_index_last(cram, 5 /* tid of chrM */, NULL);
    cram_index* last_query = cram_index_query_last(cram, 5, 19155);

    printf("index from cram_index_last       span: [%d %d] slice: %d len: %d\n", last->start, last->end, last->slice, last->len);
    printf("index from cram_index_query_last span: [%d %d] slice: %d len: %d\n", last_query->start, last_query->end, last_query->slice, last_query->len);

}

On sim1.cram this code prints:

index from cram_index_last       span: [1 19524] slice: 396 len: 5291
index from cram_index_query_last span: [1 19524] slice: 396 len: 5176

which correspond to the first and second (last) entry in the .crai file for this chromosome (tid 5):

5       1       19524   16677   396     5291
5       1       19524   22388   396     5176
-1      0       1       27985   226     245

I don't know the cram format well enough to debug this any further but I hope this points in the right direction.

Jared sim1.zip

jkbonfield commented 1 year ago

Thank you for the very clear bug report. I can reproduce it. That's disappointing as I thought we'd recently bug fixed that function! I wonder if the bug fix itself broke something else.

Yes, this will most definitely point us in the right direction.

jkbonfield commented 1 year ago

FWIW the bug I fixed before wasn't this, but a related issue elsewhere. The cause and fix are both trivial.

Many thanks for the clear bug report and data to reproduce it.

jts commented 1 year ago

Thanks for the quick fix (and shoutout on twitter!), I tested the fix on the large dataset where I first noticed the problem and can confirm I get the expected result now.