projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

How GLOW exports BGEN phased probability data to VCF #441

Closed aoblebea closed 2 years ago

aoblebea commented 2 years ago

Hello,

When converting various BGEN's to VCF with GLOW I noticed that phased probability data does not seem to be in canonical order in the resulting VCF. I.e. a diploid, tetra-allelic, phased entry in a BGEN will result in a GP field with 8 (ploidy * alleles) terms in the resulting VCF, which is consistent with how BGEN stores phased probability data, but not in VCF canonical order, for which I would expect 10 = (ploidy + alleles - 1) choose (alleles - 1) terms. For a diploid, tetra-allelic, unphased entry, however, the resulting VCF has the expected 10 terms, which is consistent with how BGEN stores unphased probability data. Have I somehow misinterpreted the VCF spec, or is this behavior intentional?

Unphased BGEN as a glow dataframe:

+----------+-----+-----+------------+---------------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|contigName|start|end  |names       |referenceAllele|alternateAlleles |genotypes                                                                                                                                              |
+----------+-----+-----+------------+---------------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|1         |12140|12141|[, 1:12141]|C              |[<NON_REF>]      |[{HG00141, false, [0, 0], 2, [1.0, 0.0, 0.0]}, {HG01958, false, [-1, -1], 2, []}, {HG01530, false, [-1, -1], 2, []}]                                   |
|1         |12144|12145|[, 1:12145]|C              |[<NON_REF>]      |[{HG00141, false, [0, 0], 2, [1.0, 0.0, 0.0]}, {HG01958, false, [-1, -1], 2, []}, {HG01530, false, [-1, -1], 2, []}]                                   |
|1         |12277|12278|[, 1:12278]|C              |[<NON_REF>]      |[{HG00141, false, [0, 0], 2, [1.0, 0.0, 0.0]}, {HG01958, false, [-1, -1], 2, []}, {HG01530, false, [-1, -1], 2, []}]                                   |
|1         |17384|17385|[, 1:17385]|G              |[A, T, <NON_REF>]|[{HG00141, false, [0, 1], 2, [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, {HG01958, false, [-1, -1], 2, []}, {HG01530, false, [-1, -1], 2, []}]|
+----------+-----+-----+------------+---------------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+

GP field of VCF conversion (HG00141, variant 1:17385): 0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00

Phased BGEN as a glow dataframe:

+----------+-----+-----+------------+---------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|contigName|start|end  |names       |referenceAllele|alternateAlleles |genotypes                                                                                                                                                                                                         |
+----------+-----+-----+------------+---------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1         |12140|12141|[, 1:12141]|C              |[<NON_REF>]      |[{HG00141, true, [0, 0], 2, [1.0, 0.0, 1.0, 0.0]}, {HG01958, true, [-1, -1], 2, []}, {HG01530, true, [-1, -1], 2, []}]                                                                                            |
|1         |12144|12145|[, 1:12145]|C              |[<NON_REF>]      |[{HG00141, true, [0, 0], 2, [1.0, 0.0, 1.0, 0.0]}, {HG01958, true, [0, 0], 2, [1.0, 0.0, 1.0, 0.0]}, {HG01530, true, [-1, -1], 2, []}]                                                                            |
|1         |12277|12278|[, 1:12278]|C              |[<NON_REF>]      |[{HG00141, true, [0, 0], 2, [1.0, 0.0, 1.0, 0.0]}, {HG01958, true, [-1, -1], 2, []}, {HG01530, true, [-1, -1], 2, []}]                                                                                            |
|1         |17384|17385|[, 1:17385]|G              |[A, T, <NON_REF>]|[{HG00141, true, [0, 1], 2, [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]}, {HG01958, true, [2, 2], 2, [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0]}, {HG01530, true, [0, 1], 2, [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]}]|
+----------+-----+-----+------------+---------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

GP field of VCF conversion (HG00141, variant 1:17385): 1.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00

williambrandler commented 2 years ago

@henrydavidge what are your thoughts on this representation of phased vs unphased bgen data? I believe we have dealt mostly with unphased data in the past

aoblebea commented 2 years ago

Thank you @williambrandler. @henrydavidge, have you had a chance to look at this?

henrydavidge commented 2 years ago

@aoblebea we chose to emit the phased probabilities in this format so that the phasing information is not lost. I agree that this is not according to spec. We may want to revisit or at least ensure that the header number type is . rather than G.

aoblebea commented 2 years ago

Makes sense, thank you.