Closed Will-Tyler closed 4 months ago
I think the test failures are related to #258 - the specific numbers we're checking against here depend on the version of numpy. Looks like we need to explicitly pin a numpy version somewhere.
I'm not sure that the test failures depend on the NumPy version. When I change my environment to use Python 3.10, the tests pass. When I change the environment to use Python 3.11, I get the failures seen in the CI. In both cases, I used NumPy version 1.26.0.
EDIT: I believe the breaking change is in numcodecs 0.13.0, which includes a change to the zstd compression algorithm (release notes). When I use version 0.13.0, I get the test failure. When I use version 0.12.1, the same tests pass. Version 0.13.0 was released on July 12th (source), the day that I opened this PR.
Should be rebased now with the numcodecs version unpinned. I'm excited to see how well this works in practice and happy to implement needed improvements that you identify. Thanks for reviewing!
Overview
In this pull request, vcf2zarr computes a genotype-level, local alleles field, LPL, during the explode step unless specifically told not to. The LPL field is related to the LAA field. The LAA field is a list of one-based indices into the variant-level
ALT
field that indicates which alleles are relevant (local) for the current sample. The LPL field is a list of the Phred-scaled genotype likelihoods for the genotypes associated with the reference allele and the alleles given by the LAA field. The source of truth for the LPL field is the LAA field and the PL field. The PL field is a list of the Phred-scaled genotype likelihoods for all possible genotypes in the variant.This pull request makes progress on #185. What remains is to prevent the creation and data storage of the PL field when vcf2zarr the local-allele fields are available.
Testing
I update and add unit tests based on the unit tests for the LAA field.
Here is the data from
local_alleles.vcf.gz
:These should cover the following cases:
References