tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Support all-nodes matching in LSHMM #2861

Open jeromekelleher opened 9 months ago

jeromekelleher commented 9 months ago

Currently a WIP, but the idea is to support all-nodes matching a-la tsinfer in the LS HMM.

It turns out that this is significantly different to the "samples-only" approach that is currently implemented here, because the interpretation of the likelihoods per-tree node is different. For samples only (and especially the "leaf samples only" version we have currently implemented) the internal node likelihood values are really just for compression, and there's no actual semantics to their values. Then, using standard parsimony algorithms works well because the arbitrary choices made to do compression are perfectly OK.

However, when each node of the tree has a well-defined likelihood that we want to preserve, parsimony is much more difficult.

I think the main difference then is how we compress the likelihoods, and so for now I've just put in a quick hack to do both approaches by having different compression methods for each. We could probably just use the "tsinfer-like" compression approach for samples only, as well, but I wanted to keep the parsimony version around for a while in case it's much more performant.

Other than that, things seem to be working OK, but there are definitely some tricky issues to resolve along the way, which I'll keep plugging at. The plan is to provide an option to the API at the top-level though, as match_all_nodes or match_samples. I think this is flexible enough, given that you can simplify down to specific subsets of nodes if you want.

cc @szhan @astheeggeggs

codecov[bot] commented 7 months ago

Codecov Report

Merging #2861 (b3366cf) into main (813f4ed) will decrease coverage by 0.02%. The diff coverage is n/a.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/graphs/tree.svg?width=650&height=150&src=pr&token=7RoAMA56V2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) ```diff @@ Coverage Diff @@ ## main #2861 +/- ## ========================================== - Coverage 89.69% 89.68% -0.02% ========================================== Files 30 30 Lines 30159 30159 Branches 5860 5856 -4 ========================================== - Hits 27052 27048 -4 - Misses 1778 1780 +2 - Partials 1329 1331 +2 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.09% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `67.89% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.90% <ø> (-0.03%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. [see 3 files with indirect coverage changes](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Last update [813f4ed...b3366cf](https://app.codecov.io/gh/tskit-dev/tskit/pull/2861?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev).