samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
644 stars 174 forks source link

VCF: "Genotype fields" vs "FORMAT" and per-sample #738

Closed jkbonfield closed 5 months ago

jkbonfield commented 1 year ago

A minor technicality.

The language of the VCF specification is to describe 8 fixed fields (CHROM to INFO) followed by Genotype data.

This feels off to me as it's describing what the FORMAT/sample data has traditionally encoded rather than describing the actual format of the VCF file. The file format is to have a bunch of keys ("FORMAT") and a set of values per sample. They don't have to encode genotype data at all, and generally most don't.

Also related to the description of columns, it would be helpful if the fixed 8 fields documented whether they are mandatory or not. My initial assumption was obviously so, but HTSlib's VCF parser (currently) handles files where a record has e.g. 4 values only. The others get treated as the "missing" value, so it feels like a deliberate mechanism. Picard rejects such data, which is more logical to me. However the specification doesn't explicitly state that the fixed columns must all be present, even if it feels like the most obvious interpretation.

jkbonfield commented 1 year ago

Also related to the description of columns, it would be helpful if the fixed 8 fields documented whether they are mandatory or not.

Doh! Ignore me. While the "Data lines" section just describes fixed columns without being explicit about them being mandatory, the previous "Header line syntax" section does infact state "8 fixed, mandatory columns". So I just didn't spot it. Apologies for that part of this issue.

Although being ultra nit-picky and related to this is whether the whole header line itself is infact mandatory! Given data doesn't have to have FORMAT and samples, it could be argued to be superfluous in that scenario. Coupled with the fact that we are quite clear in stating that the fileformat line is required (and first), it may imply that the lack of stating something is a required field means it is not a hard requirement. Although frankly you'd be heroic to assume this and both htslib and htsjdk sensibly have it has a hard rule.

The only saving grace is the structure of the file is listed as meta-information lines, a header line, and data lines. However a file without data lines is valid, so that in and of itself doesn't dictate these fields are mandatory. Very minor though!