samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

sam_itr_regarray not reporting reads #1569

Closed Mesh89 closed 1 year ago

Mesh89 commented 1 year ago

In the file s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3988780/HG00512.final.cram the region chr7:106092666-106093543 contains 286 reads.

The following code, correctly outputs all of the starting positions of those reads when argv[1] is the path of HG00512.final.cram:

samFile* fp = sam_open(argv[1], "r");
bam_hdr_t* hdr = sam_hdr_read(fp);
hts_idx_t* idx = sam_index_load(fp, argv[1]);

bam1_t* read = bam_init1();
hts_itr_t* iter = sam_itr_querys(idx, hdr, "chr7:106092666-106093543");
while (sam_itr_next(fp, iter, read) >= 0) {
    std::cout << read->core.pos << std::endl;
}

However, the following code, which I expect to be equivalent, only outputs one read for me

samFile* fp = sam_open(argv[1], "r");
bam_hdr_t* hdr = sam_hdr_read(fp);
hts_idx_t* idx = sam_index_load(fp, argv[1]);

bam1_t* read = bam_init1();
char* region[] = { "chr7:106092666-106093543" };
hts_itr_t* iter = sam_itr_regarray(idx, hdr, region, 1);
while (sam_itr_next(fp, iter, read) >= 0) {
    std::cout << read->core.pos << std::endl;
}
daviesrob commented 1 year ago

It looks like this is a bug in the index look-up code. In particular, cram_index_query_last() returns a location before the start of the region you want, which makes the iterator finish too early. It may be related to the fact that this file has cram containers which maps to a genomic region that's completely contained by an earlier one.

The exact cause isn't known yet, but it should be tracked down soon.