Class design for "selective sweep" concept.

molpopgen commented 3 years ago

This issue is motivated by #1762

The current class for modeling the structured coalescent is msprime.SweepGenicSelection. This class handles the following scenario:

Given a start/end frequency (e.g., moving backwards in time so that end < start), we generate a trajectory using the conditional diffusion methods described in Coop and Griffiths (2004). The trajectory of a mutation with fitnesses 1, 1+s/2 and 1+s for the 3 diploid genotypes, conditional on it moving from frequency start to end.
That trajectory is associated with a genomic position.
Given that trajectory, we do the operations described here to generate the history of the two parts of the population--lineages linked to the site and those that have recombined away from it (or weren't linked in the first place if start frequency is low enough).

The init arguments that matter most are start/end frequencies and s.

Moving forwards, there are many more modeling possibilities:

Handling arbitrary dominance, which amounts to a new trajectory-making function.
Adding an option that guarantees that the model end at a frequency 1/(2N). (Not guaranteeing this is the cause of #1762.) We want this to be optional so that one could chain together multiple sweep models to do "stranger" things like stitch together trajectories where s and/or dominance vary over time. By default, we probably want this feature to be on, reflecting the most common anticipated use case.
Allowing for population size changes, and possibly migration of the beneficial allele, thus getting closer to feature parity with discoal.

So, at the very least, we want the init method to account for:

Dominance, via an h kwarg
Guaranteeing that the trajectory goes down to a very small frequency, a.k.a "1/2N".

We also need a name, bearing in mind the assumptions as I understand them:

The Hudson model (as opposed to DWTF or others).
The trajectory applies to a single site and a mutation that arose in a single copy at some point in the past. (In other words, we are not allowing for the other flavor of sweeps from standing variation by which the beneficial allele was present in multiple copies of different evolutionary origin at the onset of selection. Nor are we allowing recurrent mutation to the beneficial allele during the "selective phase" of the simulation.)

I've likely missed a few things.

cc @andrewkern @jeromekelleher

molpopgen commented 3 years ago

One thing I forgot: allowing start == end along with some concept of "forever", allowing the strong balancing selection scenario of Kaplan, Darden, and Hudson?

andrewkern commented 3 years ago

this is great @molpopgen! i'm totally game to work on implementing these things.

while dealing with dominance is important eventually, i think the first thing to knock out is the selection from standing variation scenario to deal with the issue in #1762

all this should amount to is implementing drift/neutral trajectories which I have already done. I just need to finish up the testing and get a PR together.

another big priority for me is to get the population size changes done and incorporating the rejection sampling of trajectories that i've done in discoal.

For naming-- I'd suggest we set up models like "hard_sweep" , "recurrent_hard_sweep", "soft_sweep", etc. How does that strike you?

jeromekelleher commented 3 years ago

Sounds great to me @molpopgen, thanks a million for putting this together.

molpopgen commented 3 years ago

Naming is tricky because it is related to what's under the hood, at least a bit:

We could have a single class, msprime.HudsonStructuredCoalescentWithASingleTrajectory. This is explicit, but there's a good chance of lots of kwargs, all possible subsets of which cause confusion. But the plus side is a single point of failure when it comes to checking input.
We probably want to avoid inheritance: class msprime.HardSweep(msprime._HudsonStructuredBlahBlahBlah). That gets us into the super() pickling mess and some other confusion.
So, we could have lots of individual "sweep model classes", with simple init functions, and each class holds an instance of the richer _HudsonStructuredThing.

But all of the above is related to what the C would look like under all of this.

No matter what, we want the defaults to reflect common use cases, allowing you to do things like remake a figure from a 2002 paper in 1/2 screen of Python code.

andrewkern commented 3 years ago

No matter what, we want the defaults to reflect common use cases, allowing you to do things like remake a figure from a 2002 paper in 1/2 screen of Python code.

let's do it. want to review some code for me @molpopgen as I pull the bits together?

molpopgen commented 3 years ago

let's do it. want to review some code for me @molpopgen as I pull the bits together?

Yeah, definitely. I'm quite interested, because I had a near-complete diploSHIC workflow in place using this, until #1762 got in the way!

andrewkern commented 3 years ago

diploSHIC training is gonna fly once this is done.

molpopgen commented 3 years ago

diploSHIC training is gonna fly once this is done.

Yeah, in my little test work flows, it was about 30% faster. There's considerable file I/O + time spend in allel that dominates still.

andrewkern commented 3 years ago

would be great to side step that all using tskit stats....

molpopgen commented 3 years ago

would be great to side step that all using tskit stats....

Yes, the challenge there is that many of the stats are haplotype-y, and many such stats have an ad-hoc flavor to them. The whole set of D-like stats can be don using the existing code, but for many of the others, there's some thinking that needs doing.

tskit-dev / msprime

Class design for "selective sweep" concept. #1780