tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
177 stars 88 forks source link

Class design for "selective sweep" concept. #1780

Open molpopgen opened 3 years ago

molpopgen commented 3 years ago

This issue is motivated by #1762

The current class for modeling the structured coalescent is msprime.SweepGenicSelection. This class handles the following scenario:

The init arguments that matter most are start/end frequencies and s.

Moving forwards, there are many more modeling possibilities:

So, at the very least, we want the init method to account for:

We also need a name, bearing in mind the assumptions as I understand them:

  1. The Hudson model (as opposed to DWTF or others).
  2. The trajectory applies to a single site and a mutation that arose in a single copy at some point in the past. (In other words, we are not allowing for the other flavor of sweeps from standing variation by which the beneficial allele was present in multiple copies of different evolutionary origin at the onset of selection. Nor are we allowing recurrent mutation to the beneficial allele during the "selective phase" of the simulation.)

I've likely missed a few things.

cc @andrewkern @jeromekelleher

molpopgen commented 3 years ago

One thing I forgot: allowing start == end along with some concept of "forever", allowing the strong balancing selection scenario of Kaplan, Darden, and Hudson?

andrewkern commented 3 years ago

this is great @molpopgen! i'm totally game to work on implementing these things.

while dealing with dominance is important eventually, i think the first thing to knock out is the selection from standing variation scenario to deal with the issue in #1762

all this should amount to is implementing drift/neutral trajectories which I have already done. I just need to finish up the testing and get a PR together.

another big priority for me is to get the population size changes done and incorporating the rejection sampling of trajectories that i've done in discoal.

For naming-- I'd suggest we set up models like "hard_sweep" , "recurrent_hard_sweep", "soft_sweep", etc. How does that strike you?

jeromekelleher commented 3 years ago

Sounds great to me @molpopgen, thanks a million for putting this together.

molpopgen commented 3 years ago

Naming is tricky because it is related to what's under the hood, at least a bit:

But all of the above is related to what the C would look like under all of this.

No matter what, we want the defaults to reflect common use cases, allowing you to do things like remake a figure from a 2002 paper in 1/2 screen of Python code.

andrewkern commented 3 years ago

No matter what, we want the defaults to reflect common use cases, allowing you to do things like remake a figure from a 2002 paper in 1/2 screen of Python code.

let's do it. want to review some code for me @molpopgen as I pull the bits together?

molpopgen commented 3 years ago

let's do it. want to review some code for me @molpopgen as I pull the bits together?

Yeah, definitely. I'm quite interested, because I had a near-complete diploSHIC workflow in place using this, until #1762 got in the way!

andrewkern commented 3 years ago

diploSHIC training is gonna fly once this is done.

molpopgen commented 3 years ago

diploSHIC training is gonna fly once this is done.

Yeah, in my little test work flows, it was about 30% faster. There's considerable file I/O + time spend in allel that dominates still.

andrewkern commented 3 years ago

would be great to side step that all using tskit stats....

molpopgen commented 3 years ago

would be great to side step that all using tskit stats....

Yes, the challenge there is that many of the stats are haplotype-y, and many such stats have an ad-hoc flavor to them. The whole set of D-like stats can be don using the existing code, but for many of the others, there's some thinking that needs doing.