sgkit-dev / vcf-zarr-spec

VCF Zarr Specification
Apache License 2.0
10 stars 2 forks source link

VCF-Zarr spec does not support partial phasing following the VCF 4.4 spec #24

Open timothymillar opened 6 days ago

timothymillar commented 6 days ago

The VCF 4.4 spec allows for an initial symbol indicating the phasing of the first allele. For example, /0/1 is a valid genotype. This allows for partially phased diploid genotypes such as |0/1. The current VCF-Zarr spec encodes phasing using a single bool which implicitly assumes either no phasing or complete phasing.

This may have also be an issue in earlier versions of the VCF spec where a partially phased polyploid could have been encoded (e.g., 0/1|1/2). However, this isn't explicitly allowed in the 4.3 spec AFAICT.

tomwhite commented 6 days ago

We could change call_genotype_phased to have shape (variants, samples, ploidy) to support partial phasing. We could also support a shape of (variants, samples) for backwards compatibility.

tomwhite commented 6 days ago

BTW I just created a vcf-4.4 label for this - there may be other changes we should track.

jeromekelleher commented 6 days ago

I think we should consider adding a call_genotype_phase field of type integer which explicitly assigns a phase (0, ..., ploidy - 1) to each call. This would allow us to add estimated phase to datasets after the fact, rather than requiring a whole new dataset to be created when we run phasing algorithms. Ultimately this is where we want to get to with large biobanks (you could imagine having both call_genotype_phase_beagle and call_genotype_phase_shapeit stored).

There's some complexity here with how to interact with the PS and PSL fields I haven't got my head around, though.