sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
27 stars 7 forks source link

Local alleles - Add LPL field #273

Closed Will-Tyler closed 3 months ago

Will-Tyler commented 3 months ago

Overview

In this pull request, vcf2zarr computes a genotype-level, local alleles field, LPL, during the explode step unless specifically told not to. The LPL field is related to the LAA field. The LAA field is a list of one-based indices into the variant-level ALT field that indicates which alleles are relevant (local) for the current sample. The LPL field is a list of the Phred-scaled genotype likelihoods for the genotypes associated with the reference allele and the alleles given by the LAA field. The source of truth for the LPL field is the LAA field and the PL field. The PL field is a list of the Phred-scaled genotype likelihoods for all possible genotypes in the variant.

This pull request makes progress on #185. What remains is to prevent the creation and data storage of the PL field when vcf2zarr the local-allele fields are available.

Testing

I update and add unit tests based on the unit tests for the LAA field.

Here is the data from local_alleles.vcf.gz:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 01 02
chr1 3331 . T G . . . GT:DP:GQ:AD:PL 0/1:10:99:6,4:100,0,105 0/0:9:99:9,0:0,100,200
chr1 3335 . G C . . . GT:DP:GQ:AD:PL 0/0:11:99:11,0:0,100,200 1/1:7:22:0,7:154,22,0
chr1 3350 . A C,T . . . GT:DP:GQ:AD:PL 1/1:15:55:0,15,0:1002,55,1002,0,55,1002 0/2:12:99:7,0,5:154,154,0,154,102,102
chr1 3351 . A G,T,C . . . GT:LAA:AD 0/0:1,2,3:0,0,0,0 0/0:2,3:0,0,0,0
chr1 3352 . A G . . . GT:PL 0/0:30,30,30 0/0:30,60,0
chr1 3353 . A G,T,C . . . GT:PL 0:0,0,30,0 1:0,0,0,0
chr1 3354 . A G . . . GT:LAA:LPL:PL 0/0:1:30,30,30:40,40,40 0/0::30,30,30:40,40,40
chr1 3356 . G C,T . . . GT:LAA:PL 0/0:2:10,20,30,40,50,60 1/1:2:10,20,30,40,50,60
chr1 3357 . A G . . . GT:PL 0/0:30,30,30 0/0:30,.,0
chr1 3358 . A G . . . GT:PL 0/0:. 0/0:.
chr1 3359 . A G . . . GT:AD 0:20,10 1:0,20

These should cover the following cases:

References

jeromekelleher commented 3 months ago

I think the test failures are related to #258 - the specific numbers we're checking against here depend on the version of numpy. Looks like we need to explicitly pin a numpy version somewhere.

Will-Tyler commented 3 months ago

I'm not sure that the test failures depend on the NumPy version. When I change my environment to use Python 3.10, the tests pass. When I change the environment to use Python 3.11, I get the failures seen in the CI. In both cases, I used NumPy version 1.26.0.


EDIT: I believe the breaking change is in numcodecs 0.13.0, which includes a change to the zstd compression algorithm (release notes). When I use version 0.13.0, I get the test failure. When I use version 0.12.1, the same tests pass. Version 0.13.0 was released on July 12th (source), the day that I opened this PR.

coveralls commented 3 months ago

Coverage Status

coverage: 98.911% (+0.03%) from 98.886% when pulling 7e887570064ec46626a537485776daf45fe8612f on Will-Tyler:local-alleles into 4366ec1e08efef0c0110c40bc3edca6f6ae0a175 on sgkit-dev:main.

Will-Tyler commented 3 months ago

Should be rebased now with the numcodecs version unpinned. I'm excited to see how well this works in practice and happy to implement needed improvements that you identify. Thanks for reviewing!