Clarify purpose of the ``vcf_header`` attribute and/or refine?

sgkit-dev / vcf-zarr-spec

VCF Zarr Specification

Apache License 2.0

11 stars 2 forks source link

Clarify purpose of the ``vcf_header`` attribute and/or refine? #15

Open jeromekelleher opened 7 months ago

jeromekelleher commented 7 months ago

It's not clear what the vcf_header attribute is for, and how complete it is expected to be, and what people are supposed to do with it.

Some fields in the header are clearly redundant (the INFO and FORMAT field definitions, as well as CONTIG) and can/should be auto generated by conversion programs producing VCF from vcf-zarr (an important task)

So, we're actually making it harder for downstream programs to output valid VCF headers of subsets of the data by requiring that the entire thing is stored in the input.

I think it would be better to try and capture the non-redundant stuff in the header that is defined in the spec like source, reference etc as separate attributes

tomwhite commented 7 months ago

I agree.

Originally we added the vcf_header attribute to make it possible to losslessly round trip VCF -> Zarr -> VCF.

With the VCF output work we added the ability to generate a VCF header from INFO, FORMAT, and CONTIG fields stored in Zarr - and also use other fields from the vcf_header attribute, if present. (See https://github.com/pystatgen/sgkit/blob/d32b8714e026b5e0ab49812a87174edbd829b26a/sgkit/io/vcf/vcf_writer.py#L412-L559)

In fact, sgkit can now handle the case where there's no vcf_header attribute. So perhaps we can just mark it as optional in the spec (or remove entirely)?

jeromekelleher commented 7 months ago

I think it's simplest to remove entirely, and plot out some potential ways we can incorporate the missing information more systematically. I think the main thing we're losing is the provenance and reference information, which would be straightforward to add as optional attrs later on.

Otherwise we have to define what the header is for, and what should take precedence in terms of fields that are present in the dataset vs the header.

jeromekelleher commented 1 month ago

I think #28, #29 and #30 would be sufficient to remove the need for storing the full header.

More specialised handling of, e.g., ALT header information could follow later on.