sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
23 stars 5 forks source link

Local alleles #266

Closed Will-Tyler closed 2 weeks ago

Will-Tyler commented 3 weeks ago

Description

In this pull request, vcf2zarr computes a genotype-level, local alleles field, LAA, during the explode step unless specifically told not to. The LAA field is a list of one-based indices into the variant-level ALT field that indicates which alleles are relevant (local) for the current sample. The source of truth for the LAA field is the GT, AD, and PL genotype fields.

vcf2zarr does not try to compute the LAA field if the LAA field is already present in a genotype entry.

With these changes, vcf2zarr computes the LAA field by default. To prevent vcf2zarr from introducing the LAA field, the user can specify --local-alleles false when running the vcf2zarr explode CLI.

This pull request makes progress on #185.

Testing

I updated and added several unit tests. I am happy to add more tests.

I was not able to use the bcftools tag2tag plugin as discussed in #185 because generating local-allele fields is not actually implemented in the plugin yet. (See here.)

Discussion

Format of the call_LAA field

To simplify the format of the call_LAA field, vcf2zarr always assigns three-dimensional values to the LAA field, filling space with the fill value, -2, where necessary. If none of the entries use alternate alleles, then each entry's LAA is [-2]. When an entry does use alternate alleles, the alternate allele indices come first in sorted order followed by any necessary fill values. For example, [2, 5, 6, -2, -2] is a valid LAA value.

Deciding when to localize

The issue (#185) seems to focus on the PL field. vcf2zarr only introduces the local-allele fields if the PL field is present in the VCF file, and the user has not specified to not introduce local-allele fields.

Inferring which alleles are used in the PL field

When the ploidy is one, the index of the PL corresponds to the index of the allele.

When the ploidy is two, then the index of the PL value corresponds to a diploid genotype. Given a positive PL value, we want to determine which alleles are in the genotype associated with the PL value. Given a genotype $a/b$, the index of the associated PL value is $$\frac{b(b+1)}{2} + a.$$

Suppose the index of the positive PL value is $n$. To determine $b$, we can momentarily set $a = b$ and solve $$n = \frac{b(b+1)}{2} + b.$$

We have $$2n = b^2 + 3b$$ $$\implies 2n + \frac{9}{4} = b^2 + 3b + \frac{9}{4}$$ $$\implies 2n + \frac{9}{4} = \left( b + \frac{3}{2} \right)^2$$ $$\implies \sqrt{2n + \frac{9}{4}} = b + \frac{3}{2}$$ $$\implies b = \sqrt{2n + \frac{9}{4}} - \frac{3}{2}.$$

Since $a$ can be less than $b$, we take the ceiling: $$b = \left\lceil \sqrt{2n + \frac{9}{4}} - \frac{3}{2} \right\rceil .$$

Determining $a$ is then simple: $$a = n - \frac{b(b+1)}{2}.$$

References

coveralls commented 3 weeks ago

Coverage Status

coverage: 98.868% (-0.03%) from 98.895% when pulling 8547acecd33e3ddb4b32e88cfe3175feab761858 on Will-Tyler:local-alleles into 69ee0e82c03933ff06369764610168a4c8694642 on sgkit-dev:main.

coveralls commented 3 weeks ago

Coverage Status

coverage: 98.91% (+0.02%) from 98.895% when pulling 8547acecd33e3ddb4b32e88cfe3175feab761858 on Will-Tyler:local-alleles into 69ee0e82c03933ff06369764610168a4c8694642 on sgkit-dev:main.

Will-Tyler commented 2 weeks ago

Thanks for the feedback! Let me try to answer your questions here.

Why do we just count the 1-based ALT alleles? It's the same thing to count all alleles, zero based isn't it?

I assume we want the result of localizing genotype fields in vcf2zarr to be the same as if the genotype fields were already localized in the input VCF file according to the VCF specification. Therefore, the localized genotype fields produced by vcf2zarr should also follow the VCF specification.

The VCF specification dictates that if any local-allele genotype field is present, then the LAA field must be present. The LAA field is by definition a list of the one-based indices into the ALT variant field that indicates which alleles are observed in the sample at the variant.

I think staying consistent with the VCF specification helps keep the VCF Zarr representation simple. For example, if VCF Zarr data consumers encounter a call_LPL field, then they know to use the call_LAA field to determine which genotypes the values of the call_LPL field correspond to.

What's the reasoning for looking at AD as well?

The AD field might have positive values for alleles that are not called in the GT field. This is demonstrated in the LAA example in the v4.5 specification (page 14) reproduced here:

POS REF ALT FORMAT sample
1 G A,C,T,<*> GT:AD:PL 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120

I think inferring which alleles are observed with the PL field as well as you suggested is a good idea. I will look into this.

Would you mind adding an example file based on the SVCR paper example?

I am happy to add more tests—thanks for the suggestion!

I think the application to PLs is the important thing here. Otherwise, we don't really care about creating LAA at all, so I think we should try adding in LPL pretty soon so we're sure we're going in the right direction.

Good to know—I wanted to create a pull request after implementing the LAA field to get feedback, make sure I was on the right track, and keep the pull request size manageable. I will modify these changes so that vcf2zarr does not introduce local-allele fields unless the PL field is present in the input VCF. Once the LAA field changes look good, I can add the LPL field either as part of this pull request or a new pull request.


As I was thinking about this, I realized that vcf2zarr should not compute and add the LAA field if the LAA field is already present in the input VCF. I need to make some changes for that.

coveralls commented 2 weeks ago

Coverage Status

coverage: 98.882% (-0.01%) from 98.895% when pulling 46cf7bbadd19d3219b129b031c6ab891339fa9a9 on Will-Tyler:local-alleles into 69ee0e82c03933ff06369764610168a4c8694642 on sgkit-dev:main.

coveralls commented 2 weeks ago

Coverage Status

coverage: 98.881% (-0.01%) from 98.895% when pulling 819820a1b4012ec35462ee0a94f8bbc8baad467e on Will-Tyler:local-alleles into 69ee0e82c03933ff06369764610168a4c8694642 on sgkit-dev:main.

coveralls commented 2 weeks ago

Coverage Status

coverage: 98.922% (+0.03%) from 98.895% when pulling 819820a1b4012ec35462ee0a94f8bbc8baad467e on Will-Tyler:local-alleles into 69ee0e82c03933ff06369764610168a4c8694642 on sgkit-dev:main.

Will-Tyler commented 2 weeks ago

I made some changes, but I'm not quite ready for a new review. I want to try simplifying the compute_LAA_field method. I will use the request a new review GitHub feature when I am ready.

coveralls commented 2 weeks ago

Coverage Status

coverage: 98.885% (+0.03%) from 98.855% when pulling 394b2ec35ff2e8dc24b6cf9603d192d34a96b22d on Will-Tyler:local-alleles into 3e2376866b205466b2c456e11ba345cc5ff18ace on sgkit-dev:main.

Will-Tyler commented 2 weeks ago

Were you planning on adding LPL in for this PR, or doing in a follow up? I'm happy either way.

In that case, I would like to add LPL in a follow-up PR. I think this one is large enough as is.

jeromekelleher commented 2 weeks ago

Great, let's merge. Can you rebase please?

Will-Tyler commented 2 weeks ago

Should be rebased now

Will-Tyler commented 2 weeks ago

Thank you for reviewing!

Le 10 juil. 2024 à 20:09, Jerome Kelleher @.***> a écrit :

Merged #266 https://github.com/sgkit-dev/bio2zarr/pull/266 into main.

— Reply to this email directly, view it on GitHub https://github.com/sgkit-dev/bio2zarr/pull/266#event-13461517797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTQUPJ4UGZBQ7MRGKUVBD3ZLWIHLAVCNFSM6AAAAABKKA6WJ2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGQ3DCNJRG43TSNY. You are receiving this because you were mentioned.

jeromekelleher commented 2 weeks ago

Thanks for contributing!