tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

Recapitating from an incomplete pedigree #1865

Closed jeromekelleher closed 2 years ago

jeromekelleher commented 2 years ago

Consider the following code:

import msprime
import msprime.pedigrees

N = 5
tables = msprime.pedigrees.sim_pedigree(
    population_size=N, random_seed=1234, end_time=10
)

tables.sequence_length = 100
ped_ts = msprime.sim_ancestry(
    initial_state=tables,
    model="wf_ped",
    random_seed=12345,
)
final_ts = msprime.sim_ancestry(
    initial_state=ped_ts, population_size=N, model="dtwf", random_seed=123456
)

print(final_ts.draw_text())

The trees we get are:

41.00┊                              110        ┊   
     ┊                        ┏━━━━━━┻━━━━━━┓  ┊   
10.00┊                        8             9  ┊   
     ┊                        ┃             ┃  ┊   
9.00 ┊                        ┃             ┃  ┊   
     ┊                        ┃             ┃  ┊   
8.00 ┊                        ┃             ┃  ┊   
     ┊                        ┃             ┃  ┊   
7.00 ┊                       37             ┃  ┊   
     ┊                ┏━━━━━━━┻━━━━━━━┓     ┃  ┊   
6.00 ┊                ┃               ┃     ┃  ┊   
     ┊                ┃               ┃     ┃  ┊   
5.00 ┊                ┃              50     ┃  ┊   
     ┊                ┃             ┏━┻━┓   ┃  ┊   
4.00 ┊                ┃             ┃   ┃   ┃  ┊   
     ┊                ┃             ┃   ┃   ┃  ┊   
3.00 ┊                ┃             ┃   ┃   ┃  ┊   
     ┊                ┃             ┃   ┃   ┃  ┊   
2.00 ┊               83             ┃   ┃   ┃  ┊   
     ┊      ┏━━━━━━━┳━┻━━━┳━━━━━┓   ┃   ┃   ┃  ┊   
1.00 ┊     93       ┃    98     ┃   ┃   ┃   ┃  ┊   
     ┊  ┏━━━╋━━━┓   ┃   ┏━┻━┓   ┃   ┃   ┃   ┃  ┊   
0.00 ┊ 100 107 108 101 102 104 109 105 106 103 ┊   
   0.00                                     100.00 

This tree is a valid sample from a Wright-Fisher process with N=5 because we've simulated the pedigree under this model, and then completed the simulation under the same model. The pedigree finished at time 10, and we had nodes 8 and 9. These were then picked up as the samples in a DTWF simulation starting from this point, and all was good.

However, as @hyanwong and @petrelharp pointed out, there's a problem when we have an incompete pedigree. Suppose we break 10% of the links in the pedigree, like this:


N = 5
tables = msprime.pedigrees.sim_pedigree(
    population_size=N, random_seed=1234, end_time=10
)

parents = tables.individuals.parents
parents[0::10] = -1
tables.individuals.parents = parents

tables.sequence_length = 100
ped_ts = msprime.sim_ancestry(
    initial_state=tables,
    model="wf_ped",
    random_seed=12345,
)
print(ped_ts.draw_text())

the tree sequence we get out of the pedigree looks like this:

10.00┊          9                              ┊   
     ┊          ┃                              ┊   
9.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
8.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
7.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
6.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
5.00 ┊          ┃                 50           ┊   
     ┊          ┃            ┏━━━━━┻━━━━┓      ┊   
4.00 ┊          ┃           67          ┃      ┊   
     ┊          ┃       ┏━━━━┻━━━┓      ┃      ┊   
3.00 ┊          ┃       ┃        ┃      ┃      ┊   
     ┊          ┃       ┃        ┃      ┃      ┊   
2.00 ┊         83       ┃       85      ┃      ┊   
     ┊      ┏━━━╋━━━┓   ┃     ┏━━┻━━┓   ┃      ┊   
1.00 ┊      ┃   ┃   ┃   ┃    92     ┃   ┃  90  ┊   
     ┊      ┃   ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┃  ┊   
0.00 ┊ 100 101 104 108 102 105 107 106 109 103 ┊   
   0.00                                     100.00 

So, we have nodes 100, 9, and 90 popping out of the pedigree at different times. Then, if we recapitate as before we get

31.00┊                      112                ┊   
     ┊             ┏━━━━━━━━━┻━━━━━━━━━━┓      ┊   
15.00┊            111                   ┃      ┊   
     ┊    ┏━━━━━━━━┻━━━━━━━┓            ┃      ┊   
14.00┊   110               ┃            ┃      ┊   
     ┊  ┏━┻━┓              ┃            ┃      ┊   
10.00┊  ┃   ┃              ┃            9      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
9.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
8.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
7.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
6.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
5.00 ┊  ┃   ┃             50            ┃      ┊   
     ┊  ┃   ┃        ┏━━━━━┻━━━━┓       ┃      ┊   
4.00 ┊  ┃   ┃       67          ┃       ┃      ┊   
     ┊  ┃   ┃   ┏━━━━┻━━━┓      ┃       ┃      ┊   
3.00 ┊  ┃   ┃   ┃        ┃      ┃       ┃      ┊   
     ┊  ┃   ┃   ┃        ┃      ┃       ┃      ┊   
2.00 ┊  ┃   ┃   ┃       85      ┃      83      ┊   
     ┊  ┃   ┃   ┃     ┏━━┻━━┓   ┃   ┏━━━╋━━━┓  ┊   
1.00 ┊  ┃  90   ┃    92     ┃   ┃   ┃   ┃   ┃  ┊   
     ┊  ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┃   ┃   ┃  ┊   
0.00 ┊ 100 103 102 105 107 106 109 101 104 108 ┊   
   0.00                                     100.00 

We have picked up the three pedigree founders and simulated their history out to coalescence, as required. However, this is not a valid Wright-Fisher sample.

When we start the recapitation simulation, msprime sees that there is an sample at node 90 in generation 1, so it sets things up accordingly and starts simulating a DTWF where 90 is a sample and there is a diploid population of 5 individuals which it can descend from. However, this is not true, because the size of the population it's in is effectively smaller due to the existence of the pedigree individuals.

It seems like we should be able to correct for this by reasoning about the number of lineages that are in the trees we're recapitating from at each time. This would be tricky, probably, so it would be good to get people's thoughts on what's happening here.

Pinging @sgravel @LukeAndersonTrocme @apragsdale for thoughts

jeromekelleher commented 2 years ago

I guess there's a few possibilities:

  1. Don't allow recapitating with roots at different times, and work directly with incomplete ancestry we get back from the simulation-thru-pedigree. Not particularly satisfying, but you could probably still do a lot with the incomplete trees.
  2. If incomplete trees aren't enough, complete your pedigree under the defined demography model. Possible, but would require implementing a pedigree simulator that can deal with complex demography (see #1864). However, I guess this pedigree simulator would have to still solve the issues about how many degrees of freedom it has left, given the number of extant pedigree individuals in each population.
  3. Don't worry about it, it's all an approximation anyway.

I guess 3 is defensible if our population size N is large relative to the number of individuals in the pedigree at a given time point. I'm not sure this is true, though, in many of the cases where we have pedigrees (say, cattle?).

apragsdale commented 2 years ago

In the "naive" approach, even if we were to adjust the "effective" N by subtracting the number of lineages that are in the trees at each generation in the DTWF simulation until the most ancient root of the pedigree, we'd still be imposing structure between in-pedigree vs out-of-pedigree samples and their ancestors. Unless I'm overlooking something, you should get back a valid WF sample by allowing coalescence back into the pedigree, by setting probability of picking a node within the pedigree each generation based on the number of lineages in the pedigree that generation. That does sound painful to set up, however, especially with recombination changing number of lineages at a given time in adjacent trees. We'd need to be able to remove existing branches in the tables, add intermediate nodes, and reintroduce the resulting split edges. (Or am I way off base here with what the main issue with that is?)

There's also the issue that many (most?) pedigrees we'll want to work with aren't from a random-mating WF setting, maybe with overlapping generations or non-ultrametric(?) paths through the pedigree, so trying to fill in the missing data in the pedigree with a WF simulation would result in mismatched mating assumptions for coexisting individuals from the same population. Perhaps for that reason, I'd lean toward your option 1 that you listed.

jeromekelleher commented 2 years ago

Sounds about right to me @apragsdale, all very tricky.

hyanwong commented 2 years ago

Well reasoned, @jeromekelleher - thanks for picking this up and running with it. As @apragsdale points out, I think the inherent problem is that it is (very) unlikely in real data that a WF model will be a good approximation of the "missing" part of the pedigree (i.e. between the point of the youngest node with a missing pedigree link and the oldest node in the pedigree). However we fill in this data is likely to give a wrong answer, unless, as in your suggestion (2), we actually allow some sort of user-specified model.

I'm inclined to go with 1 until someone has the time to think about implementing 2.

Also, note that this isn't really about differential times of the oldest "roots" in the pedigree. It's really about whether the pedigree has any missing links in it, and how to deal with that. We could have all the lineages in the pedigree starting at the same time in the past, but if there are missing parts of the pedigree, we can't fill them in without some demographic model.

hyanwong commented 2 years ago

Also, in a tutorial could we give a Python example showing how to do the "naive" thing and recapitate part of the resulting tree sequence with a WF simulation, up to the point of the oldest node in the pedigree? That would then make a tree sequence suitable for a later round of full recapitation, I think? But it would hit the problems mentioned above, so we should hedge it all with text that says "this is a massive approximation, only valid for large populations where the reverse WF simulated lineages can't join into the higher parts of the pedigree".

LukeAndersonTrocme commented 2 years ago

I agree with what the others have said. One thought I have is to let the user know for each sample, what fraction of their ancestors are within or beyond the tree compared to other samples. This would allow users to be more weary of the simulations for individuals with less complete pedigrees. I don't know if this would be very useful, after all, I can imagine some careful pruning of the pedigree before simulations would take care of many of these issues. So in that regard, maybe a simple warning message would motivate people to prune their trees themselves to avoid some of these issues.

LukeAndersonTrocme commented 2 years ago

Regarding option 2, I agree this sounds like a lot of work, but I wanted to mention that at least for the BALSAC genealogy, we have some spatial information (town locations) and would be able to feed some fairly high resolution demographic parameters to plug into the model. If we know that a root is at a give place and genealogical depth, we could just plug in the known demographic parameters for that place at that time (i.e. we would know that 90% of people in this town, did not migrate, but the 10% who did came from townA)

sgravel commented 2 years ago

I am not too concerned with respecting Wright-Fisher proportions, for reasons others mentioned above.

I would go with option 3 for now, that is, explicitly state that we are modelling all missing parents as coming from one or more populations that are distinct from the genealogical population. It's an approximation, but it's one that is easy to explain and understand.

I think the correct thing to do in option 2) really depends on the structure of the missing data -- do we have random-ish missing links, or do the missing links represent large-scale demographic events (e.g., in Quebec, families moving to the US for a decade or two then coming back). I think the way to do 2) is definitely to pre-simulate genealogies as suggested above, so I feel like it would be an extension that would not add too much code refactoring.

hyanwong commented 2 years ago

I would go with option 3 for now, that is, explicitly state that we are modelling all missing parents as coming from one or more populations that are distinct from the genealogical population. It's an approximation, but it's one that is easy to explain and understand.

I agree that it is easy to explain. But perhaps we shouldn't build in the capability as a tweakable knob in the pedigree API, but simply provide recipes for how to do that using some tskit python, until we figure out what the correct solution actually is?

gregorgorjanc commented 2 years ago

Option 3 with a warning to provide a way to recapitate pedigrees?

Option 1 would be of course the other extreme.

Why is the large N in the population, compared to what is in the pedigree, a requirement? I never understood this part of the coalescent. As to what is the situation in reality, it will vary widely.

Case A) There are closed populations where all individuals are part of one or many pedigrees (many if we would maintain dis-joint groups of individuals that never mate) - so, in those cases there are no other individuals (from that population, but there will be other populations of course). I am not fully clear about the simulation steps up the pedigree. If one would like to allow for coalescence within the pedigree for the case A, could you step back in time and every generation check if there are any nodes that don't have a parent and allow coalescence into other available parents within a pedigree?

Case B) There will be cases where we know pedigree for only a small subset of a population - in those cases we could have a large N in pop and a small N in pedigree. Use option 3 in this case?

Case C) Everything in between cases A and B.

For each case multiple dis-joint pedigrees could represent different populations - like you showed in the meeting. There is an additional choice (in addition to a choice between the case A and B) - do you allow coalescence from one pedigree into another before all nodes coalesced within each pedigree (I think this makes sense for the case A, but less for the case B).

Yeah, real cases are hard!

sgravel commented 2 years ago

@hyanwong I think that drawing the founders from distinct population is an appropriate solution in at least a few important cases, and I would expect that this will be the most common use case. I also feel like letting the recapitation step occur separately will lead to more bugs and confusion. But it is also a valid approach.

@gregorgorjanc I think most useful cases will fall will be closer to A (largely complete genealogy, can broadly ignore missing links) or to B (lots of disjoint pedigrees, very sparse, can treat missing links as random). The cases in between (large connected very incomplete genealogies) exist, but they will just be quite hard to interpret in a simulation framework without having to specify a rather complex missingness model. I think Option 3 covers cases A and B pretty well.

petrelharp commented 2 years ago

I vote for option (3), with no warning, just clear documentation. I suspect that in applications it'll be more common that the pedigree represents a complete pedigree, in which case this behavior is correct (up to minor quibbles about the population size of the WF). And, we've even heard from someone who would want to use this in the case where different roots are at different times (they've got a simulator that simulates the pedigrees within outbreaks, that can be of different ages, with no shared history between them).

gregorgorjanc commented 2 years ago

@petrelharp while I agree with option 3, I would like to point that real pedigrees always have some missing data - the farther back in time you go, the more missingness there is. "Complete pedigrees" are very rare and actually an oxymoron, because you always run out of info about ancestors as you go back into the past;)

petrelharp commented 2 years ago

True!

apragsdale commented 2 years ago

For the case of a nearly complete pedigree, option 3 would introduce increased relatedness or inbreeding among individuals/samples from the shallow portion of the pedigree, since more of their ancestors would be found in the detached population with size smaller than N (possibly much smaller). So depending on the relatedness/inbreeding structure and the size of the pedigree, you could find recent inbreeding and relatedness artificially correlating with the pedigree depth of samples, right?

So while option 3 might be good enough for some scenarios, I imagine it could distort things quite a bit in others. Maybe this is a minor effect in most scenarios of interest, but perhaps deserves checking a bit more thoroughly?

jeromekelleher commented 2 years ago

Thanks for the input all, this is very helpful. I'll chew on it for a few days and come back with a proposal.

sgravel commented 2 years ago

@apragsdale, To clarify, I'll define three models:

The "remainder" model will refer to a model with the pedigree plus a remainder population with size Ne - N{pedigree}. To do this, you need to set up a migration rate between the two pedigree and the remainder population that are typically very high (m of order 1).

Because setting up these migrations correctly is hard, you could also ignore these migrations. I'll call this the "remainder without migration" model. I believe that this is what you are referring to and is indeed a bad model because of pathological increases in relatedness among the founders

The third option is the "founder" model, where each missing individual is taken to come from a distinct population of size Ne with no adjustment for size. That is the model I prefer, since 1-it works well when the genealogy corresponds to a founder population, and 2- even in the case of missing links, the errors are likely to be tolerable in most cases where genealogical simulations make sense.

I think in practice many genealogies will have a combination of missing links due to missing data within a population and missing links due to immigration. I think that the founder model does an acceptable job at both.

gregorgorjanc commented 2 years ago

@sgravel I like your model formulation and agree that the "founder" model should work well-ish enough.

jeromekelleher commented 2 years ago

Linking this to #2009 - we need to spell out the model here in the docs.

jeromekelleher commented 2 years ago

Closing this issue as I think the basic decision has been made, we just need to document it (#2009)