samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

`bcf_sr_seek` behaves differently for VCF.gz and BCF #1623

Closed pd3 closed 1 year ago

pd3 commented 1 year ago

When bcf_sr_seek() is called on non-existent region of VCF vs BCF different return status is returned and the internal state seems to change in a way which makes a subsequent call of bcf_sr_next_line fail. Consider the following example of the VCF/BCF file:

##fileformat=VCFv4.2
##reference=file://some/path/human_g1k_v37.fasta
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  a
19  2   .   C   A   .   .   .   GT  0/0
19  3   .   G   C   .   .   .   GT  0/0
19  4   .   T   C   .   .   .   GT  0/1
19  5   .   A   C   .   .   .   GT  1/1

bcftools view -o test.bcf test.vcf && bcftools index test.bcf
bcftools view -o test.vcf.gz test.vcf && bcftools index test.vcf.gz

and the code

#include <stdio.h> 
#include <stdarg.h> 
#include "htslib/synced_bcf_reader.h"

void error(const char *format, ...) 
{
    va_list ap;
    va_start(ap, format);
    vfprintf(stderr, format, ap);
    va_end(ap);
    exit(-1);
}

int main(int argc, char *argv[])
{   
    bcf_srs_t *sr = bcf_sr_init();
    sr->require_index = 1;
    if ( bcf_sr_add_reader(sr,argv[1])!=1 ) error("Unable to read %s\n",argv[1]);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"17",0));
    fprintf(stderr,"      %d\n",bcf_sr_next_line(sr));
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"18",0));
    fprintf(stderr,"      %d\n",bcf_sr_next_line(sr));
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"19",0));
    fprintf(stderr,"      %d\n",bcf_sr_next_line(sr));
    bcf_sr_destroy(sr);
    return 0;
} 

gcc -Wall -I. test.c -o test libhts.a -lz -llzma -lbz2 -lz -lm   -lcurl -lcrypto -lpthread

Running on VCF vs BCF we get

./test test.bcf
seek: 0
      1
seek: 0
      1
seek: 0
      1

./test test.vcf.gz
seek: -1
       1
seek: -1
       0
seek:  0
       0

This difference is the inherent cause of https://github.com/samtools/bcftools/issues/1918

pd3 commented 1 year ago

I traced the problem down to https://github.com/samtools/htslib/blob/a59bc5256ee4f539ae47edd575dc1340356d922a/tbx.c#L211-L217

The tbx_readrec() function returns the length of the string on success whereas the corresponding bcf_readrec() returns 0. The iterator then returns this value via hts_itr_next() https://github.com/samtools/htslib/blob/a59bc5256ee4f539ae47edd575dc1340356d922a/hts.c#L3909-L3917

Arguably the iterator should behave consistently across all data types.

pd3 commented 1 year ago

Continuing with this, it seems that bcf_sr_seek is a wrong solution for the task: bcftools consensus wants to query a region similarly to bcf_sr_set_regions, but on the fly. Instead the function attempts to seek to a contig and if not present, a streaming-like mode is entered based on the order of contigs in the header (BCF) or tabix index. In this mode repeated seeks to a non-existent locations result in different outcome from subsequent bcf_sr_next_line calls.

A desired fix should (1) make repeated seek+next_line calls consistent and (2) provide on-the-fly alternative to bcf_sr_set_regions