tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
178 stars 87 forks source link

Decouple trajectory from sweeps #1813

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

I'm finding the time and population size conversions in the sweeps model super confusing, and I would really like to decouple the trajectory simulation from the sweep definition. I don't think this will be that hard, and it will definitely be worth it.

A sweep model is invoked at a particular time, and on a particular population. All it needs in order to compute the trajectory is a method that returns the size of that population at a time t (plus it's parameters, ploidy, etc?)

What does the trajectory simulator fundamentally need to know @andrewkern?

I want to get to a point where we can say

sweep = Sweep(position=L/2, trajectory_generator=GenicSelection(...))

It would be much easier if time was all in units of generations also - is there a particular reason why dt is currently in scaled coalescent units? (And, what ploidy does it assume?)

I feel like if we can agree on a definition of how to do this in Python using (e.g.) the DemographyDebugger to compute population sizes then we'll be in a much better position to start broadening out to support more types of sweep.

andrewkern commented 3 years ago

I'm finding the time and population size conversions in the sweeps model super confusing, and I would really like to decouple the trajectory simulation from the sweep definition. I don't think this will be that hard, and it will definitely be worth it.

yeah i agree. bear with my being a bit slow here-- i'm juggling a lot of tasks at the moment

A sweep model is invoked at a particular time, and on a particular population. All it needs in order to compute the trajectory is a method that returns the size of that population at a time t (plus it's parameters, ploidy, etc?)

What does the trajectory simulator fundamentally need to know @andrewkern?

yes a trajectory needs to be able to access the size of the population at time t, and the parameters of the associated trajectory type. the full parameter set that sweeps would need would be the start and end freqs, the selection coefficient (s; s=0 being the special case of drift), the frequency at which the allele first came under selection (call it f_0, e.g., f_0 = 1/2N for de novo beneficial mutations), and perhaps eventually the dominance coefficient.

I want to get to a point where we can say

sweep = Sweep(position=L/2, trajectory_generator=GenicSelection(...))

It would be much easier if time was all in units of generations also - is there a particular reason why dt is currently in scaled coalescent units? (And, what ploidy does it assume?)

so the dt is scaled in coalescent units because that is the timescale for the diffusion approximation through which we generate the trajectories, however you'll note that while i'm saving sweep times in coalescent units here when the actual ancestry simulation happens the times are converted to generations here

looking over this, this is way more confusing that it needs to be probably.

I feel like if we can agree on a definition of how to do this in Python using (e.g.) the DemographyDebugger to compute population sizes then we'll be in a much better position to start broadening out to support more types of sweep.