Closed Will-Tyler closed 2 weeks ago
Thanks for the feedback! Let me try to answer your questions here.
Why do we just count the 1-based ALT alleles? It's the same thing to count all alleles, zero based isn't it?
I assume we want the result of localizing genotype fields in vcf2zarr to be the same as if the genotype fields were already localized in the input VCF file according to the VCF specification. Therefore, the localized genotype fields produced by vcf2zarr should also follow the VCF specification.
The VCF specification dictates that if any local-allele genotype field is present, then the LAA field must be present. The LAA field is by definition a list of the one-based indices into the ALT variant field that indicates which alleles are observed in the sample at the variant.
I think staying consistent with the VCF specification helps keep the VCF Zarr representation simple. For example, if VCF Zarr data consumers encounter a call_LPL field, then they know to use the call_LAA field to determine which genotypes the values of the call_LPL field correspond to.
What's the reasoning for looking at AD as well?
The AD field might have positive values for alleles that are not called in the GT field. This is demonstrated in the LAA example in the v4.5 specification (page 14) reproduced here:
POS | REF | ALT | FORMAT | sample |
---|---|---|---|---|
1 | G | A,C,T,<*> | GT:AD:PL | 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120 |
I think inferring which alleles are observed with the PL field as well as you suggested is a good idea. I will look into this.
Would you mind adding an example file based on the SVCR paper example?
I am happy to add more tests—thanks for the suggestion!
I think the application to PLs is the important thing here. Otherwise, we don't really care about creating LAA at all, so I think we should try adding in LPL pretty soon so we're sure we're going in the right direction.
Good to know—I wanted to create a pull request after implementing the LAA field to get feedback, make sure I was on the right track, and keep the pull request size manageable. I will modify these changes so that vcf2zarr does not introduce local-allele fields unless the PL field is present in the input VCF. Once the LAA field changes look good, I can add the LPL field either as part of this pull request or a new pull request.
As I was thinking about this, I realized that vcf2zarr should not compute and add the LAA field if the LAA field is already present in the input VCF. I need to make some changes for that.
I made some changes, but I'm not quite ready for a new review. I want to try simplifying the compute_LAA_field
method. I will use the request a new review GitHub feature when I am ready.
Were you planning on adding LPL in for this PR, or doing in a follow up? I'm happy either way.
In that case, I would like to add LPL in a follow-up PR. I think this one is large enough as is.
Great, let's merge. Can you rebase please?
Should be rebased now
Thank you for reviewing!
Le 10 juil. 2024 à 20:09, Jerome Kelleher @.***> a écrit :
Merged #266 https://github.com/sgkit-dev/bio2zarr/pull/266 into main.
— Reply to this email directly, view it on GitHub https://github.com/sgkit-dev/bio2zarr/pull/266#event-13461517797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTQUPJ4UGZBQ7MRGKUVBD3ZLWIHLAVCNFSM6AAAAABKKA6WJ2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGQ3DCNJRG43TSNY. You are receiving this because you were mentioned.
Thanks for contributing!
Description
In this pull request, vcf2zarr computes a genotype-level, local alleles field,
LAA
, during the explode step unless specifically told not to. TheLAA
field is a list of one-based indices into the variant-levelALT
field that indicates which alleles are relevant (local) for the current sample. The source of truth for theLAA
field is the GT, AD, and PL genotype fields.vcf2zarr does not try to compute the LAA field if the LAA field is already present in a genotype entry.
With these changes, vcf2zarr computes the
LAA
field by default. To prevent vcf2zarr from introducing the LAA field, the user can specify--local-alleles false
when running the vcf2zarr explode CLI.This pull request makes progress on #185.
Testing
I updated and added several unit tests. I am happy to add more tests.
I was not able to use the bcftools tag2tag plugin as discussed in #185 because generating local-allele fields is not actually implemented in the plugin yet. (See here.)
Discussion
Format of the
call_LAA
fieldTo simplify the format of the
call_LAA
field, vcf2zarr always assigns three-dimensional values to theLAA
field, filling space with the fill value,-2
, where necessary. If none of the entries use alternate alleles, then each entry'sLAA
is[-2]
. When an entry does use alternate alleles, the alternate allele indices come first in sorted order followed by any necessary fill values. For example,[2, 5, 6, -2, -2]
is a validLAA
value.Deciding when to localize
The issue (#185) seems to focus on the PL field. vcf2zarr only introduces the local-allele fields if the
PL
field is present in the VCF file, and the user has not specified to not introduce local-allele fields.Inferring which alleles are used in the PL field
When the ploidy is one, the index of the PL corresponds to the index of the allele.
When the ploidy is two, then the index of the PL value corresponds to a diploid genotype. Given a positive PL value, we want to determine which alleles are in the genotype associated with the PL value. Given a genotype $a/b$, the index of the associated PL value is $$\frac{b(b+1)}{2} + a.$$
Suppose the index of the positive PL value is $n$. To determine $b$, we can momentarily set $a = b$ and solve $$n = \frac{b(b+1)}{2} + b.$$
We have $$2n = b^2 + 3b$$ $$\implies 2n + \frac{9}{4} = b^2 + 3b + \frac{9}{4}$$ $$\implies 2n + \frac{9}{4} = \left( b + \frac{3}{2} \right)^2$$ $$\implies \sqrt{2n + \frac{9}{4}} = b + \frac{3}{2}$$ $$\implies b = \sqrt{2n + \frac{9}{4}} - \frac{3}{2}.$$
Since $a$ can be less than $b$, we take the ceiling: $$b = \left\lceil \sqrt{2n + \frac{9}{4}} - \frac{3}{2} \right\rceil .$$
Determining $a$ is then simple: $$a = n - \frac{b(b+1)}{2}.$$
References
LAA
field as well as the other local-allele genotype fields. The latest version, v4.5, was apparently finalized and published last week.