tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 69 forks source link

extend path algorithm (extend_edgesV2) #2938

Open hfr1tz3 opened 2 months ago

hfr1tz3 commented 2 months ago

New extension (pun intended) to the extend_edges algorithm now called extend_paths (subject to change). We noticed that certain examples within unary regions would not be extended simply with extend edges. We propose this algorithm to handle examples like the following: For edge path $a \to b \to c$ in tree $T_1$ and edge $a \to d \to c$ in $T_2$, assuming $d\notin T_1$ and $b\notin T_2$ and $t(b)>t(d)$ we should have extended path $a\to b \to d \to c$ in tree $T_2$. Unit tests still need to be built, and the algorithm needs to be cleaned.

We hypothesize that using this alongside extend edges could be the optimal strategy for minimizing edges in tree sequences.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.65%. Comparing base (e7fdbe1) to head (4229bfd). Report is 14 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2938 +/- ## ========================================== - Coverage 89.63% 86.65% -2.98% ========================================== Files 29 11 -18 Lines 30174 22932 -7242 Branches 5873 4254 -1619 ========================================== - Hits 27046 19872 -7174 + Misses 1789 1754 -35 + Partials 1339 1306 -33 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2938/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2938/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.20% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2938/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2938/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `88.72% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2938/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. [see 18 files with indirect coverage changes](https://app.codecov.io/gh/tskit-dev/tskit/pull/2938/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)
petrelharp commented 1 month ago

Proposal for the basis for a test: here is a definition of what should be extendable by this method; so we can check if there remain no extendable edges (when run until no further changes are made).

A right extendable path from $x$ to $y$ in a tree $T_k$ is a path $x <_k a_1 <_k \cdots <_k a_n <_k y$ (where $<_k$ denotes "below in tree $T_k$" that has the properties:

  1. $x <_{k+1} b1 <{k+1} \cdots <_{k+1} bm <{k+1} y$ for some $b_i$ (the endpoints of the path are in the next tree)
  2. for $1 < j < n$, $aj$ is not in $T{k+1}$, and for $1 < j < m$, $bj$ is not in $T{k}$ (none of the intermediate nodes from the two paths are in the other tree).

The nodes $a_i$ are right extendable, and the nodes $b_i$ are left extendable.

Why is this the right definition? Well, first it's clear we can't extend nodes that are in the next tree. Next, suppose that there is a $bj$ on the path from $x$ to $y$ in $T{k+1}$ that is also in $T_k$, but not on the path from $x$ to $y$. If $b_j <_k x$ then we can take $b_j$ as the other endpoint of the extendable path instead of $y$. If $b_j \nless_k x$, then that implies ancestry above $b_j$ was inherited along a different path back to the time of $x$, so we can't use inheritance from $x$ as a criteria for extending. A similar argument holds if $y <_k b_j$ or and $y \nless_k b_j$.

petrelharp commented 1 month ago

I've added two things:

  1. a simple check based on the previous comment for whether a ts is no longer extendable; so if our algorithm terminates before max_iter, then that test should agree; and
  2. an implementation of extend paths that just rebuilds the whole tree sequence after every tree, so is a lot simpler. (I called it "naive", but it's not actually simple enough to be 'naive'; I had to deal with quite a few fiddly things for it.) It's real slow, so can't be used on anything at all big.

Next step: compare it to the non-naive version.