Open freeseek opened 4 years ago
Here is code to recreate the issue. Create a minimal main.c
file:
$ echo '#include <htslib/vcf.h>
#include <htslib/hts.h>
int main(int argc, char *argv[]) {
htsFile *fh = hts_open("-", "wv");
if ( !fh ) {
fprintf(stderr, "Unable to open output VCF file\n");
exit(-1);
}
bcf_hdr_t *hdr = bcf_hdr_init("w");
bcf_hdr_append(hdr, "##INFO=<ID=XXX:1,Number=1,Type=Integer");
if (bcf_hdr_write(fh, hdr) < 0) {
fprintf(stderr, "Unable to write to output VCF file\n");
exit(-1);
}
bcf_hdr_destroy(hdr);
hts_close(fh);
return 0;
}' > main.c
Compile using the current version of HTSlib:
$ HTSDIR="..."
$ gcc -I$HTSDIR main.c $HTSDIR/libhts.a -lz -lm -lbz2 -llzma -lcurl
Run the binary:
$ ./a.out
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=XXX:1,Number=1,Type=Integer>
#CHROM POS ID REF ALT QUAL FILTER INFO
A malformed VCF is easily generated this way as XXX:1
does not match the regular expression ^([A-Za-z][0-9A-Za-z.]*|1000G)$
If I attempt to add a header string to a VCF header using
bcf_hdr_append()
, the string gets passed to the functionbcf_hdr_parse_line()
which parses the key and passes it tobcf_hrec_add_key()
. However, in none of these steps the key is checked for consistency with the VCF specification which says thatINFO keys must match the regular expression^([A-Za-z][0-9A-Za-z.]*|1000G)$, please note that “1000G” is allowed as a special legacy value.
Ideally,
bcf_hdr_parse_line()
is the place where this check should take place. There is some code that seems like it should do that inbcf_hdr_parse_line()
:However, this code gets applied to what is left of the key (i.e., the
ID
part). So if I was to invoke:The above code would be applied to the
ID
part of the string which would be valid and indeed the code would add this line to the header and the INFO field in the VCF dictionary structure. This does not seem like the intended behavior.