tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Fix tests for lshmm 0.0.8 #2966

Closed szhan closed 3 months ago

szhan commented 3 months ago

Description

Update tests and requirements to use lshmm 0.0.7. Fixes #2962

PR Checklist:

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.63%. Comparing base (d3c59ba) to head (898c480).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2966 +/- ## ======================================= Coverage 89.63% 89.63% ======================================= Files 29 29 Lines 30176 30176 Branches 5877 5877 ======================================= Hits 27047 27047 Misses 1790 1790 Partials 1339 1339 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2966/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2966/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.20% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2966/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2966/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `88.72% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2966/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `99.01% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more.
jeromekelleher commented 3 months ago

Looks like some tests are failing with a new error:

248
        if not np.all(num_alleles > 1):
249
            err_msg = "Some sites have less than two distinct alleles."
250
>           raise ValueError(err_msg)
251
E           ValueError: Some sites have less than two distinct alleles.
szhan commented 3 months ago

The requirements that the first value of recomb. prob. array being zero and all sites in the input data must be variable are removed. See pre-release 0.0.8.

szhan commented 3 months ago

I suspect that the failing tests are caused by not excluding MISSING when counting the number of alleles to scale mutation rates, because all the tests in test_haplotype_matching.py pass when not including the queries with MISSING values.

jeromekelleher commented 3 months ago

I think we can skip the failing diploid - I'm not sure how soon the diploid model will actually be implemented, so let's not worry about it. Just mark the failing tests as skip.

szhan commented 3 months ago

I'm trying to understand how the emission probability is set in the tree-based implementations. In lshmm, when the query allele is MISSING, the emission probability is set to 1.0 instead of whatever the value is when the query and ref. alleles are equal. I think this is causing the failed tests.

jeromekelleher commented 3 months ago

I don't know either to be honest. Let's not get bogged down in it if it's just on the diploid side, there's no corresponding C code (and no plan to implement)

szhan commented 3 months ago

I've updated how the emission probability is set when the query allele is MISSING. All the haplotype matching tests are now passing.

szhan commented 3 months ago

I suspect the tests for the diploid case are failing because the fact that number of alleles can vary across sites is not being accounted for when computing the emission probability matrix. It should be fine if all the sites were biallelic, but some sites are probably invariant from the simulations.