tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
170 stars 84 forks source link

modify checks arg lik #2232

Closed GertjanBisschop closed 6 months ago

GertjanBisschop commented 6 months ago

Slight modification to the polytomy check when computing likelihoods. There are a couple of test cases in the current test suite that include examples with multiple roots (within a single local tree). Thus far these did not fail as they were always limited to two roots at most. These checks could probably be a bit more sophisticated as there are many cases where a decapitated arg could easily be associated with a well-defined likelihood value. @JereKoskela, input welcome.

codecov[bot] commented 6 months ago

Codecov Report

Merging #2232 (cfe1d95) into main (71cdb16) will increase coverage by 0.00%. The diff coverage is 100.00%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232/graphs/tree.svg?width=650&height=150&src=pr&token=1uZQV1KMkU&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) ```diff @@ Coverage Diff @@ ## main #2232 +/- ## ======================================= Coverage 91.52% 91.52% ======================================= Files 20 20 Lines 11334 11337 +3 Branches 2302 2304 +2 ======================================= + Hits 10373 10376 +3 Misses 523 523 Partials 438 438 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [C](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `91.52% <100.00%> (+<0.01%)` | :arrow_up: | | [python](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.70% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [msprime/likelihood.py](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-bXNwcmltZS9saWtlbGlob29kLnB5) | `100.00% <100.00%> (ø)` | | ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Last update [71cdb16...cfe1d95](https://app.codecov.io/gh/tskit-dev/msprime/pull/2232?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev).
JereKoskela commented 6 months ago

It has been two years since I've really looked at this, but I don't think multiple roots should cause any problems in the likelihood function implementation. Was the point of your update in #2107 that the existing test was incorrectly picking up multi-root trees as error cases @GertjanBisschop? If so, then I think the right fix is to just tweak the test to not do that, without catching ARGs with multiple roots with a new error message. Or am I missing something?

GertjanBisschop commented 6 months ago

I was mainly unsure about what happens when we add in nodes by decapitating for example. Then we have a bunch of floating lineages essentially not ending with a common ancestor event. What is the likelihood of such a lineage given that the likelihood computation, from my understanding, checks whether given a coalescence and recombination rate what the next out of those two events is, and what the likelihood is associated with that.

JereKoskela commented 6 months ago

You're right: floating lineages ending in a degree 1 root would cause problems. Roots with degree 2 caused by common ancestor events are already handled correctly. Is there a good way to detect that distinction?

Edit: It would be pretty easy to add support for the situation where the ARG has just been deterministically stopped before all roots have been reached, causing all remaining extant lineages to end in degree 1 roots at exactly the same time. We'd just use the usual likelihood function until the last recombination or common ancestor node, and then include one more factor for the probability that nothing else happens until the end time. The only thing I'm not sure of is how easy it would be for the function to detect this case.

If there can be floating lineages ending at a variety of times for no apparent reason, then I'm not sure what the interpretation would be and we'd probably want to just return an error.

JereKoskela commented 6 months ago

If I'm reading the new test right, then I think there is an issue we need to solve before merging. It looks like this likelihood will return a multiple roots error for any ARG with more than one root. That will include nearly all msprime realisations, even with full_arg = True.

Before merging, we should get this to a stage where multiple common ancestor roots don't throw an error. Else we're rendering the likelihood function effectively unusable.

GertjanBisschop commented 6 months ago

Yep was about to say the same. I agree with @JereKoskela. Modifying the computation to say that we stopped before another event happened is the way to go. I'll create an issue.

GertjanBisschop commented 6 months ago

This will return an error in the presence of any local tree with more than one root. This root does not need to be the same across local trees, so the graph does not need to have a single root. Or am I misunderstanding your comment?

JereKoskela commented 6 months ago

Ok, that sounds better. I was misunderstanding. In that case let's merge and continue in the issue you just opened.