tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
170 stars 84 forks source link

Save information about sweep in SweepGenicSelection model #2243

Open jeromekelleher opened 6 months ago

jeromekelleher commented 6 months ago

A few different people have been asking about how we keep more information about sweeps in the SweepGenicSelection. It's not entirely clear to me how we do this, but here are some basic options:

  1. Write a mutation representing the advantageous allele to the mutations table (but, what node is it at? Seems tricky to do in practise given that it's all stochastic)
  2. Add some unary nodes to mark the beginning and the end of the sweep. That is, when we move a lineages into label 1 here we create a node to track this event and add an edge, and when we move lineages back into label 0 here we add another edge.

Option 2 seems like the only viable approach to me, and fits in reasonably well with the additional_nodes APIs that are just about to drop in v 1.3. So, I guess we'd add a NodeType.LABEL_MIGRANT or something to keep track of this?

GertjanBisschop commented 6 months ago

Yes. I agree, option 2 is the way to go. There shouldn't be much in the way of doing this apart from defining a new NodeType and modifying msp_move_individual to record a node and the corresponding edges. The ability to update node flags (updating a recombinant node with the additional label_migrant flag) is already in place. The naming of the new NodeType might depend on the (distant) plans with the structured coalescent. Would we define a new node type for every structured coalescent model?

jeromekelleher commented 6 months ago

I'm a bit unclear as to what the node type should represent to be honest, but then my understanding of the structured coalescent is pretty hazy. Let's see if others have thoughts.

molpopgen commented 6 months ago

If I understand what folks have been asking for, they wish to know if a node is in the FAVORED "deme" (the one carrying the beneficial allele) or the UNFAVORED "deme" (the wild-type). One could imagine setting a node flag to 1 to represent FAVORED when alleles move into that "deme" and to 0 when they move out of it?

andrewkern commented 6 months ago

if i follow, don't we already have this info in the segment label?

GertjanBisschop commented 6 months ago

You are right @andrewkern we keep track of that information during the simulation but this is not stored in the tree sequence/tables. What we want to achieve is indeed storing whether a node is in the FAVORED or UNFAVORED deme, but without overloading the migration concept. Yet keep this general enough for future uses of the structured coalescent with for example inversions. Or sweeps with actual population structure.

andrewkern commented 6 months ago

okay neat. many of the structured coalescent models are time inhomogenous, e.g. a beneficial allele arises at some point, structuring the population past that point forward in time), so the idea of adding unary nodes suggested above is appealing to me to mark potential beginning and end of such phases.

maybe the distinction to be made would be between structured coalescent that is induced by mutation (e.g. selective backgrounds, inversions, duplications) versus by migration among demes?