tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Two locus branch stats python prototype #2912

Closed lkirk closed 2 months ago

lkirk commented 3 months ago

Description

This is a python prototype for computing two-locus branch statistics. There are a number of things missing from this prototype, notably:

Right now, the code outputs the statistical results in units of branch length, (to get the true stats, one will need to multiply by $\mu^2$ or by the product of the total branch lengths of each compared tree).

Hopefully, I stopped before things got too complicated, though I did integrate this method with bit arrays (soon to be bitsets?). The initial commit for this PR (though undocumented) contains a working algorithm with python sets if that's easier to read.

I implemented this almost exactly as I plan to implement it in C, so we should discuss whether or not this iteration approach / state storage approach works or if there is a more modern way of doing things. I remember discussions about a more modern approach to tree diff iteration in C, but I don't recall where the example is. In any case, it will be easy to update to whatever iteration pattern is preferred.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.65%. Comparing base (e6483fc) to head (1761e05).

:exclamation: Current head 1761e05 differs from pull request most recent head 850600c. Consider uploading reports for the commit 850600c to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2912 +/- ## ========================================== - Coverage 89.62% 86.65% -2.97% ========================================== Files 29 11 -18 Lines 30176 22934 -7242 Branches 5874 4255 -1619 ========================================== - Hits 27044 19874 -7170 + Misses 1793 1754 -39 + Partials 1339 1306 -33 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2912/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2912/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.21% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2912/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2912/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `88.72% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2912/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. [see 18 files with indirect coverage changes](https://app.codecov.io/gh/tskit-dev/tskit/pull/2912/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)
jeromekelleher commented 3 months ago

Note - the specific comments I made are not particularly relevant given the high-level ones. I just made them as I was going, and thought they might be worth keeping anyway. Feel free to ignore.

lkirk commented 3 months ago

@jeromekelleher Thanks for taking a look. The TreeState object was just a way of storing state so that I could easily track and duplicate it between comparisons of trees. I wasn't exactly sure how to gather that data effectively, but tsk_tree_position_t is exactly what I was looking for. I can imagine replacing what I have with calls from that API.

EDIT: I read through your comment more carefully and saw the bit about the python tree positioning. I'll give that a try.

lkirk commented 3 months ago

@jeromekelleher I did a pass over the code and simplified things a bit. I'm still using my TreeState object for now because it makes copying the state easy. I've switched to using the TreePosition object (this is great, definitely makes edge addition/removal more fail-safe). I've also removed some redundant looping over parents. I tried to make compute_stat_update a bit easier to read, but the bitsets make it complicated.

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

jeromekelleher commented 3 months ago

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling tsk_tree_next, so that we don't end up duplicating quite so much logic around tree transitions. I would probably implement the tsk_treeseq_branch_general_stat using this now, if I was doing it again. However, sometimes you do need to update edge-by-edge, and so maintaining the parent is necessary.

As a side-note, I do want to expose the tsk_tree_position_t to Python at some point, and come up with an efficient way of doing these incremental style algorithms in numba as discussed in #2778. Ultimately, I would like to reach the point where doing stuff like these LD calculations can be achieved efficiently using numba, so that tree sequence algorithms don't have to be implemented in C (tree algorithms can totally be done in numba at the moment). This is a key thing I want to illustrate in the paper.

lkirk commented 3 months ago

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling tsk_tree_next, so that we don't end up duplicating quite so much logic around tree transitions. I would probably implement the tsk_treeseq_branch_general_stat using this now, if I was doing it again. However, sometimes you do need to update edge-by-edge, and so maintaining the parent is necessary.

As a side-note, I do want to expose the tsk_tree_position_t to Python at some point, and come up with an efficient way of doing these incremental style algorithms in numba as discussed in #2778. Ultimately, I would like to reach the point where doing stuff like these LD calculations can be achieved efficiently using numba, so that tree sequence algorithms don't have to be implemented in C (tree algorithms can totally be done in numba at the moment). This is a key thing I want to illustrate in the paper.

Great, I'm glad this is converging on something we're happy with. I'm going to clean this up and integrate it with the rest of the prototype code, adding some realistic tests along the way.

On the subject of the C sketch, I suppose there's a couple of things to consider:

  1. It might be more efficient to update as on the fly instead of using tsk_tree_next, but I'm not sure if any of those additional cycles will be relevant given the computational burden of intersecting sets. As we've discussed in the past, sometimes simple outweighs efficient. That said, I'd like to do a quick sketch with tree_next and tree_position_next to see if it really matters.

  2. I'd be interested in working with numba, but I think I'd like to produce an initial cut with C because I can do that quickly and I have a couple of other downstream analyses I'd really like to get to. That said, I'd be happy to follow up on this and help to re-implement some things in numba.

lkirk commented 3 months ago

Okay, things are shaping up here. A couple of notes:

  1. I'm going to do another pass in subsequent PRs to implement positions/sample sets/polarisation.
  2. The "naive_matrix" function is very slow and inefficient. I want to rip it out once I implement the C code. It's there right now as an orthogonal check to the current implementation. However, there are some caveats: 1) it can't cover all of the test cases because it requires MRCAs. 2) It's too slow to be run in the CI right now. I've capped the slowest test at 3 minutes (I hope that's acceptable amount of time. I can also just disable them altogether if that's preferable).

I think this is ready for another round of review, I'm removing the draft status.

lkirk commented 2 months ago

Great, thank you for taking a look. I've squashed, it's ready when you get the chance.

Oh, hm. I'm seeing some cache issues with the tests. -- I couldn't immediately see a way to rerun these.

jeromekelleher commented 2 months ago

Feel free to open a fresh PR if that's easier?

benjeffery commented 2 months ago

Feel free to open a fresh PR if that's easier?

For future reference - the cache is shared between PRs, so this usually doesn't help.