tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
174 stars 87 forks source link

multiple merger documentation and truncation point #1216

Closed eldonb closed 4 years ago

eldonb commented 4 years ago

In the documentation for multiple-merger coalescents, I would state that time is measured in units of N^{alpha - 1} generations, rather than Ne^{alpha - 1}, where N is the census population size ( unless your notation for census size is Ne ?). The standard notation for "effective size" is Ne, which in this case would be (proportional to) N^{alpha - 1}.

The truncation point in the incomplete Beta-coalescent will depend on the actual underlying population model, but assuming a simple version can be taken as K/( K + m(alpha)), where m(alpha) = 1 + (2^{1 - alpha}) / (alpha - 1) in the haploid case (for 1 < alpha < 2), where K > 0 is the proportionality constant of the cut-off (each individual can produce at most KN 'juveniles' in the population model). The point is as alpha -> 1 the truncation point decreases.
In the diploid case (working with the Xi-Beta(2-alpha, alpha, K) coalescent) one can take m(alpha) = 2 + (2^{alpha}3^{1 - alpha})/(alpha - 1) (this can vary depending on the underlying population model but we don't need to go into that here).

jeromekelleher commented 4 years ago

I think the confusion here is that "population size" can be specified through the parameter Ne to the simulate function. However, with the new sim_ancestry interface, we've dropped Ne and are using population_size instead.

So, I think your idea is a good one @eldonb. @JereKoskela, thoughts?

JereKoskela commented 4 years ago

I don't have strong views on N vs Ne. I tend to think of N as the census size, and Ne as the population size which one needs to plug in to match the observed diversity, but that's far from well-established. Do we need notation for population sizes and coalescence probabilities elsewhere in the documentation? I think the most important thing is to be consistent.

The current implementation of the truncation point is a little bit more naive than what @eldonb described: basically, the user inputs the compound parameter truncation_point = K/(K + m(alpha)) directly. The disadvantage is that it's less explicit, but the advantage is that the standard, untruncated Beta-coalescent can be represented exactly as truncation_point = 1. If we had the user specify K instead, then we would need something like K = inf as a default parameter to get K/(K + m(alpha)) = 1. @jeromekelleher, do you see any pitfalls with something like that?

I think I've also made a small mistake in the formula for m(alpha) in the haploid case. I'll fix that, and take the opportunity to make any other tweaks we decide upon here at the same time.

jeromekelleher commented 4 years ago

Do we need notation for population sizes and coalescence probabilities elsewhere in the documentation?

I guess we probably don't. We don't really need to get into the specifics of the model so much for non-multiple-mergers coalescents, as the details of the models are well known. It's probably better to use N rather than Ne if census population size is what we mean (and vice-versa, I guess).

If we had the user specify K instead, then we would need something like K = inf as a default parameter to get K/(K + m(alpha)) = 1.

I'm sure we could make it work. Either inf or DBL_MAX are used in quite a few places as default values, so this should be fine I think.

JereKoskela commented 4 years ago

A question @jeromekelleher: I think I've made all the necessary changes in implementation, but wanted to run the multiple merger statistical tests in verification.py before opening a PR. Running python3 verification.py -c BetaSFS results in AttributeError: module 'msprime' has no attribute 'simulator_factory' I see that simulator_factory was changed to _parse_simulate in #1190, but making the same switch in verification.py via ctrl+r results in AttributeError: module 'msprime' has no attribute '_parse_simulate' as well. Any idea what the problem might be?

jeromekelleher commented 4 years ago

Erp, my bad @JereKoskela - I forgot to update the stats tests when doing recent refactorings. I'll fix ASAP.

jeromekelleher commented 4 years ago

1220