tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 69 forks source link

Specify positions for branch LD matrix #2949

Closed lkirk closed 3 weeks ago

lkirk commented 1 month ago

Introduces the ability to subset the branch LD matrix with genomic positions. This is a little trickier than subsetting the LD matrix with sites because positions can map to duplicate tree indices. We still only allow unique, sorted positions as input. For instance, if we receive [0.1, 0.2, 0.3, 2.5, 2.8] from a user, it may map to [0, 0, 0, 2, 2]. To this end, I've repurposed the check_sites validation code and added some logic for dealing with index repeats.

For now, until I devise some simple optimizations for matrix filling, we're filling a square matrix entry by entry. If we are skipping trees with our specified positions, we must iterate over the entire range regardless because we're keeping a running sum of the statistic and we need the intermediate state regardless of whether or not we're keeping it.

The one tricky part of this implementation was making sure that seeking to an arbitrary position in the tree sequence works properly. We only want to add branches that are valid for the current tree, so I mask out branches that have been removed up to the position we start iterating over when pos.seek_forward is called. I'm not sure what the intended future of seek_forward is, I noticed that there's a warning about this function being in active development. So far, it's working for what's needed here.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.63%. Comparing base (0913965) to head (a6becd8).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2949 +/- ## ======================================= Coverage 89.63% 89.63% ======================================= Files 29 29 Lines 30176 30176 Branches 5877 5877 ======================================= Hits 27047 27047 Misses 1790 1790 Partials 1339 1339 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2949/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2949/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.20% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2949/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2949/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `88.72% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2949/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `99.01% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more.
jeromekelleher commented 1 month ago

The one tricky part of this implementation was making sure that seeking to an arbitrary position in the tree sequence works properly. We only want to add branches that are valid for the current tree, so I mask out branches that have been removed up to the position we start iterating over when pos.seek_forward is called.

While it's not fully tested, I'm fairly sure the seek_forward code is correct. Here is the way it's intended to be used: https://github.com/tskit-dev/tskit/blob/998d710a4737d8794ab5e000c9440c5571cec9cc/python/tests/test_tree_positioning.py#L101

So, you detect which edges are not in the target tree, and skip over them.

However there is a subtle issue about how statistics are computed, which I'm not sure about when doing any form of skipping over. This revolves around how values are propagated up and down the tree. I think the answer is guaranteed to be correct, but you may end up in pathological cases in which the log-time guarantees are not present.

So - maybe it would be best to just naively accumulate your statistics without trying to skip trees/edges?

lkirk commented 1 month ago

@jeromekelleher thanks for pointing me to the proper usage. I have gone ahead and modified the code to properly consume the seek_forward result.

However there is a subtle issue about how statistics are computed, which I'm not sure about when doing any form of skipping over. This revolves around how values are propagated up and down the tree. I think the answer is guaranteed to be correct, but you may end up in pathological cases in which the log-time guarantees are not present.

So - maybe it would be best to just naively accumulate your statistics without trying to skip trees/edges?

Is the concern that the operations are not guaranteed to be performed in log time or that they will be incorrect? I'm certain when we initialize the first tree in a sequence, the stats end up being correct (given proper ordering). And, when we seek across a range, the stats are correct. I think we do want to skip over all trees prior to the range we're considering though, because computing these stats is quite expensive (branch by branch comparisons of samples under nodes being the bulk of the computation).

The algorithm initializes all edges in the first tree in the range, then only looks at subsequent diffs, keeping a running total (which is why we can't skip any trees in between the first and last tree in the range). This is (in our opinion) the simplest way to perform this operation, while performing stats computations on a minimal subset of trees.

I could be misinterpreting what you mean by "naively accumulate your statistics without trying to skip trees/edges". I assumed that meant iterating over the whole tree sequence. This will be quite slow for large tree sequences, especially when we're concerned with a minimal range within. I suppose you could truncate the tree sequence with ts.keep_intervals, but I'm not sure that'd be ideal for some scenarios.

In the C version, I think this will be more simple because I'll be using tsk_tree_seek, which will give me access to all of the edges in the current tree and the diffs as we move to the next tree.

jeromekelleher commented 1 month ago

I'm happy to merge this @lkirk, do you want to do some final tidy-up first?

lkirk commented 4 weeks ago

Hi @jeromekelleher sorry for my delayed response. I did a little cleanup, it's rebased and squashed now. Ready for merging.