tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Two locus branch stats python prototype #2934

Closed lkirk closed 5 months ago

lkirk commented 5 months ago

I've re-opened a PR to avoid some issues with build caching. Please see the original discussion on #2912.

Currently, this algorithm creates a matrix of LD, performing a pairwise comparison of all trees in the tree sequence.

This implementation lacks windows/positions, sample sets and polarisation. The outputs of the code produce results in units of branch length, needing to be multiplied by mu^2 or divided by product of the total branch length of the two trees.

This algorithm works by keeping a running sum of the statistic between two trees, updating each time we encounter a branch addition or removal. The tricky part is that we have to remove or add LD contributed by samples that already existed or that will remain under a given node after the addition/removal of branches.

We include a validation against the original formulation of this problem, by including an implementation that was described in McVean

  1. The original formulation computing the covariance of tMRCAs of 2, 3, and 4 samples of individuals on the trees in question. This implementation has several limitations 1) it is very slow and 2) it does not work for trees that are decapitated, as certain samples do not have MRCAs.
lkirk commented 5 months ago

@jeromekelleher I'm still seeing issues with the build cache. Is there anything I can do to help here? I'm not in a huge rush for what it's worth, just wanted to close this out.

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.62%. Comparing base (3349fdd) to head (343a89b).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2934 +/- ## ========================================== - Coverage 89.63% 89.62% -0.01% ========================================== Files 29 29 Lines 30174 30174 Branches 5873 5873 ========================================== - Hits 27046 27043 -3 - Misses 1789 1791 +2 - Partials 1339 1340 +1 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2934/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2934/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.20% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2934/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2934/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `88.72% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2934/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `99.00% <ø> (-0.03%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. [see 2 files with indirect coverage changes](https://app.codecov.io/gh/tskit-dev/tskit/pull/2934/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)
jeromekelleher commented 5 months ago

@benjeffery any advice here?

benjeffery commented 5 months ago

@benjeffery any advice here?

I'm in progress with this at #2935

jeromekelleher commented 5 months ago

We've pushed through a few changes in the CI setup, can you rebase now please @lkirk? Hopefully things will go a bit more smoothly now.

lkirk commented 5 months ago

Thank you @benjeffery for fixing that!

@jeromekelleher This is ready now, I chopped many more slow tests from this. I realized that there are only 2 cores being used on the test servers and I hadn't anticipated how the slow tests would compound in runtime. As written, things were taking ~1.5 hours and I've adjusted things so now only important tests are considered, adding about 2 minutes to the test runtime.

Does this seem reasonable?

jeromekelleher commented 5 months ago

2 minutes is still quite a bit. You could just use the naive version on a handful of small cases to verify things are lining up as expected, and defer doing tests on the full range of examples until the C version is in place?

benjeffery commented 5 months ago

Any slow tests should be marked with @pytest.mark.slow.

lkirk commented 5 months ago

@jeromekelleher Agreed, I've cut a couple more tests to reach a minimal subset and now the runtime is the same -- at least in the number of minutes, I did a quick check of the runtimes on main and compared to these and everything seems to be running in the same amount of time, within a minute at least. Slowest added test runs in ~.2 seconds on my relatively fast computer.

Here's the times from the latest run on main: image

So, it appears that the difference in runtime is sub-minute now. Is that reasonable?

If not, I can just mark as slow.

benjeffery commented 5 months ago

@mergifyio rebase

mergify[bot] commented 5 months ago

rebase

✅ Branch has been successfully rebased