Closed aoblebea closed 2 years ago
@henrydavidge what are your thoughts on this representation of phased vs unphased bgen data? I believe we have dealt mostly with unphased data in the past
Thank you @williambrandler. @henrydavidge, have you had a chance to look at this?
@aoblebea we chose to emit the phased probabilities in this format so that the phasing information is not lost. I agree that this is not according to spec. We may want to revisit or at least ensure that the header number type is .
rather than G
.
Makes sense, thank you.
Hello,
When converting various BGEN's to VCF with GLOW I noticed that phased probability data does not seem to be in canonical order in the resulting VCF. I.e. a diploid, tetra-allelic, phased entry in a BGEN will result in a GP field with 8 (ploidy * alleles) terms in the resulting VCF, which is consistent with how BGEN stores phased probability data, but not in VCF canonical order, for which I would expect 10 = (ploidy + alleles - 1) choose (alleles - 1) terms. For a diploid, tetra-allelic, unphased entry, however, the resulting VCF has the expected 10 terms, which is consistent with how BGEN stores unphased probability data. Have I somehow misinterpreted the VCF spec, or is this behavior intentional?
Unphased BGEN as a glow dataframe:
GP field of VCF conversion (HG00141, variant 1:17385): 0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
Phased BGEN as a glow dataframe:
GP field of VCF conversion (HG00141, variant 1:17385): 1.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00