Open timothymillar opened 6 days ago
We could change call_genotype_phased
to have shape (variants, samples, ploidy)
to support partial phasing. We could also support a shape of (variants, samples)
for backwards compatibility.
BTW I just created a vcf-4.4 label for this - there may be other changes we should track.
I think we should consider adding a call_genotype_phase
field of type integer which explicitly assigns a phase (0, ..., ploidy - 1) to each call. This would allow us to add estimated phase to datasets after the fact, rather than requiring a whole new dataset to be created when we run phasing algorithms. Ultimately this is where we want to get to with large biobanks (you could imagine having both call_genotype_phase_beagle
and call_genotype_phase_shapeit
stored).
There's some complexity here with how to interact with the PS and PSL fields I haven't got my head around, though.
The VCF 4.4 spec allows for an initial symbol indicating the phasing of the first allele. For example,
/0/1
is a valid genotype. This allows for partially phased diploid genotypes such as|0/1
. The current VCF-Zarr spec encodes phasing using a single bool which implicitly assumes either no phasing or complete phasing.This may have also be an issue in earlier versions of the VCF spec where a partially phased polyploid could have been encoded (e.g.,
0/1|1/2
). However, this isn't explicitly allowed in the 4.3 spec AFAICT.