samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

vcf: Handling structured header records with missing IDs in VCF 4.1/4.2 #760

Closed zaeleus closed 1 month ago

zaeleus commented 3 months ago

In VCF 4.3, the concept of structured and unstructured records was introduced, which requires all structured records to include an ID field (§ 1.4 "Meta-information lines" (2022-11-27)). Was this a clarification or a change from previous versions?

VCF 4.1 and 4.2 don't define what header lines are. How are they expected to be parsed? Specifically, I'm interested in how structured header records with missing IDs are supposed to be handled/interpreted.

For example,

$ curl --silent https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2017/clinvar_20170516.vcf.gz | gzip -d | head | grep --extended-regexp "^##(fileformat|ID)"
##fileformat=VCFv4.1
##ID=<Description="ClinVar Variation ID">
d-cameron commented 1 month ago

Clarified as requiring ID in VCFv4.3.

Can't really lock down the underdefined 4.1/4.2 specs at this point. How to handle these files will be Implementation-dependent.