Open jeromekelleher opened 7 months ago
I agree.
Originally we added the vcf_header
attribute to make it possible to losslessly round trip VCF -> Zarr -> VCF.
With the VCF output work we added the ability to generate a VCF header from INFO, FORMAT, and CONTIG fields stored in Zarr - and also use other fields from the vcf_header
attribute, if present. (See https://github.com/pystatgen/sgkit/blob/d32b8714e026b5e0ab49812a87174edbd829b26a/sgkit/io/vcf/vcf_writer.py#L412-L559)
In fact, sgkit can now handle the case where there's no vcf_header
attribute. So perhaps we can just mark it as optional in the spec (or remove entirely)?
I think it's simplest to remove entirely, and plot out some potential ways we can incorporate the missing information more systematically. I think the main thing we're losing is the provenance and reference information, which would be straightforward to add as optional attrs later on.
Otherwise we have to define what the header is for, and what should take precedence in terms of fields that are present in the dataset vs the header.
I think #28, #29 and #30 would be sufficient to remove the need for storing the full header.
More specialised handling of, e.g., ALT header information could follow later on.
It's not clear what the
vcf_header
attribute is for, and how complete it is expected to be, and what people are supposed to do with it.Some fields in the header are clearly redundant (the INFO and FORMAT field definitions, as well as CONTIG) and can/should be auto generated by conversion programs producing VCF from vcf-zarr (an important task)
So, we're actually making it harder for downstream programs to output valid VCF headers of subsets of the data by requiring that the entire thing is stored in the input.
I think it would be better to try and capture the non-redundant stuff in the header that is defined in the spec like
source
,reference
etc as separate attributes