samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

bcf_hdr_idinfo_exists(hdr,type,int_id) on binary VCFs with gapped BCF_DT_ID tables #1538

Closed freeseek closed 1 year ago

freeseek commented 1 year ago

Let's suppose I generate a binary VCF with a gapped BCF_DT_ID table:

$ echo -e "##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" | \
  bcftools view --no-version -Ou -c0 | \
  bcftools annotate --no-version -Ou -x INFO/AC
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes",IDX=2>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Where now the IDX=1 entry is gone.

Now I generate a simple executable:

$ echo '#include <htslib/vcf.h>
int main() {
  htsFile *in = hts_open("-", "r");
  bcf_hdr_t *hdr = bcf_hdr_read(in);
  fprintf(stderr, "hdr->n[BCF_DT_ID]=%d\n", hdr->n[BCF_DT_ID]);
  for (int int_id = 0; int_id < hdr->n[BCF_DT_ID]; int_id++) {
    fprintf(stderr, "hdr->id[BCF_DT_ID][%d].val=%p\n", int_id, hdr->id[BCF_DT_ID][int_id].val);
    fprintf(stderr, "bcf_hdr_idinfo_exists(hdr, BCF_HL_INFO, %d)=%d\n", int_id, bcf_hdr_idinfo_exists(hdr, BCF_HL_INFO, int_id));
  }
  bcf_hdr_destroy(hdr);
  hts_close(in);
  return 0;
}' > /tmp/main.c
$ gcc /tmp/main.c -o /tmp/main -lhts

If I pipe the previous binary VCF into this executable:

$ echo -e "##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" | \
  bcftools view --no-version -Ou -c0 | \
  bcftools annotate --no-version -Ou -x INFO/AC | \
  /tmp/main
hdr->n[BCF_DT_ID]=3
hdr->id[BCF_DT_ID][0].val=0x5617da06f918
bcf_hdr_idinfo_exists(hdr, BCF_HL_INFO, 0)=0
hdr->id[BCF_DT_ID][1].val=(nil)
Segmentation fault (core dumped)

I get that bcf_hdr_idinfo_exists(hdr,type,int_id) causes segmentation fault. This macro is defined in htslib/vcf.h as follows:

    #define bcf_hdr_id2coltype(hdr,type,int_id) (uint32_t)((hdr)->id[BCF_DT_ID][int_id].val->info[type] & 0xf)
    #define bcf_hdr_idinfo_exists(hdr,type,int_id)  ((int_id)>=0 && bcf_hdr_id2coltype((hdr),(type),(int_id))!=0xf)

And the problem is that it tries to access (hdr)->id[BCF_DT_ID][int_id].val without first verifying that it exists. And there is no attempt at verifying that int_id might refer to a gap in the BCF_DT_ID table.

Is this the intended behavior or should the bcf_hdr_idinfo_exists(hdr,type,int_id) macro be amended to first check that (hdr)->id[BCF_DT_ID][int_id].val indeed exists? Or what should be the proper way to loop through all the INFO fields in the header?