samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 447 forks source link

bcf_hdr_parse_line() and bcf_hrec_add_key() in vcf.c do not check for INFO and FORMAT keys matching VCF specifications #1065

Open freeseek opened 4 years ago

freeseek commented 4 years ago

If I attempt to add a header string to a VCF header using bcf_hdr_append(), the string gets passed to the function bcf_hdr_parse_line() which parses the key and passes it to bcf_hrec_add_key(). However, in none of these steps the key is checked for consistency with the VCF specification which says that INFO keys must match the regular expression^([A-Za-z][0-9A-Za-z.]*|1000G)$, please note that “1000G” is allowed as a special legacy value.

Ideally, bcf_hdr_parse_line() is the place where this check should take place. There is some code that seems like it should do that in bcf_hdr_parse_line():

        // ^[A-Za-z_][0-9A-Za-z_.]*$
        if (p==q && *q && (isalpha_c(*q) || *q=='_'))
        {
            q++;
            while ( *q && (isalnum_c(*q) || *q=='_' || *q=='.') ) q++;
        }

However, this code gets applied to what is left of the key (i.e., the ID part). So if I was to invoke:

        bcf_hdr_append(hdr, "##INFO=<ID=XXX:1");

The above code would be applied to the ID part of the string which would be valid and indeed the code would add this line to the header and the INFO field in the VCF dictionary structure. This does not seem like the intended behavior.

freeseek commented 2 years ago

Here is code to recreate the issue. Create a minimal main.c file:

$ echo '#include <htslib/vcf.h>
#include <htslib/hts.h>

int main(int argc, char *argv[]) {
  htsFile *fh = hts_open("-", "wv");
  if ( !fh ) {
    fprintf(stderr, "Unable to open output VCF file\n");
    exit(-1);
  }
  bcf_hdr_t *hdr = bcf_hdr_init("w");
  bcf_hdr_append(hdr, "##INFO=<ID=XXX:1,Number=1,Type=Integer");
  if (bcf_hdr_write(fh, hdr) < 0) {
    fprintf(stderr, "Unable to write to output VCF file\n");
    exit(-1);
  }
  bcf_hdr_destroy(hdr);
  hts_close(fh);
  return 0;
}' > main.c

Compile using the current version of HTSlib:

$ HTSDIR="..."
$ gcc -I$HTSDIR main.c $HTSDIR/libhts.a -lz -lm -lbz2 -llzma -lcurl

Run the binary:

$ ./a.out
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=XXX:1,Number=1,Type=Integer>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

A malformed VCF is easily generated this way as XXX:1 does not match the regular expression ^([A-Za-z][0-9A-Za-z.]*|1000G)$