tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
173 stars 85 forks source link

add tree fixation using newick string and TreeSequence object #2276

Open not-a-feature opened 5 months ago

not-a-feature commented 5 months ago

This PR will add an option (--ce-from-nwk) to provide a tree in newick format that will be respected during simulation (hudson model).

The aim is to make sure that the given tree is present and that regular simulation (gene conversion and recombination) is still possible. The main idea is that each segment has an origin set (in the beginning only the id of the leaf), which is passed through and merged with the origins of the other segment at each coalescence event. When a regular event occurs, the algorithm checks if the selected segment is needed later, if so, the event is ignored.

Changes:

The following structural changes will be made:

General

Segment class

Population class

Simulator class

Example:

The command

py algorithms.py --recombination-rate 0 --gene-conversion-rate 0 1 --sequence-length 10000 10 out --discrete --ce-events "((((A:0.005, B:0.005):0.01, (C:0.005, D:0.005):0.01):0.1,((E:0.005, F:0.005):0.01, (G:0.005, H:0.005):0.01):0.01):0.01, (I:0.005, J:0.005):0.01)" 

will produce following tree:

ts.draw_text() tskit_arg_visualizer

Event though the first coalescence events (at Node 10-14) should happen at the same time (0.005), the events are slightly postponed as no two events can happen at the same time.

Todo / Open Tasks

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.30%. Comparing base (0b7e04d) to head (9fd4538).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2276 +/- ## ======================================= Coverage 91.30% 91.30% ======================================= Files 20 20 Lines 11873 11873 Branches 2421 2421 ======================================= Hits 10841 10841 Misses 563 563 Partials 469 469 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/msprime/pull/2276/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [C](https://app.codecov.io/gh/tskit-dev/msprime/pull/2276/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `91.30% <ø> (ø)` | | | [python](https://app.codecov.io/gh/tskit-dev/msprime/pull/2276/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.71% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jeromekelleher commented 5 months ago

This looks very cool @not-a-feature, thanks! I'll need a bit of time to digest, and I think it would probably help to have a call about it. @fbaumdicker - I assume you're involved here somewhere?

fbaumdicker commented 5 months ago

Yes, I am. @not-a-feature and I are incorporating a) mutation models representing gene presence-absence evolution, b) fixed clonal backbone phylogenies (this PR), and c) simulating gene transfer instead of conversion. A call would be great. Let's find a time in Slack.

not-a-feature commented 5 months ago

As discussed loading the common ancestor events from a TreeSequence file is now supported. Currently it is only tested on well defined trees without gene conversion or recombination.

Either use the --ce-from-ts option to provide a file path or Simulator(..., coalescent_events_ts) and provide a TreeSequnce object directly.

It raises an error if it is used in combination with --ce-from-nwk.

jeromekelleher commented 5 months ago

I was thinking more about providing this information as part of the initial state, rather than as something extrinsic. So, we provide the "backbone" of the simulation pre-done as some nodes and edges, and (importantly) these edges are not subsequently altered. Is this possible, or does that break the model?