time calibration - Githubissues

wsdewitt commented 4 years ago

Rerun 1KG using the "strict mask", where we have a total mask like mask = (strict mask) ∪ (repeat mask) ∪ (conservation mask).
After fitting mushi using the dimensionless time in the model, we'll need a method mushi.kSFS.calibrate(μ0), where μ0 = (site-wise rate from literature) * ((genome size) - (mask size)) fixes the total mutation rate at t=0

wsdewitt commented 4 years ago

It will also be good to normalize mutation rates by (masked) genomic ancestral triplet content. This will make the rate types more directory comparable as mutations per site per generation (rather than mutations per genome per generation)

wsdewitt commented 4 years ago

The current time calibration seems to be placing events (out-of-Africa and the TCC pulse) older than expected. For example, here's demography and TCC for CEU.

The current approach to time calibration works as follows:

Compute a masked size for each autosome that excludes repeats, conserved sites, sites outside the 1KG strict mask, and sites with low-confidence ancestral state assignment. The resulting autosomal masked genome size is computed as 2.2Gb.
multiply the above by 1.3e-8 (from Tychelle's paper) giving a mutation rate measured per genome per generation.
After performing inference in the dimensionless time (measured in generations), multiply times by a generation time estimate of 29 years for plotting.

Instead of fitting the demography, we can instead assume the Tennesen et al. demography for Europeans. Here's the fit to the total SFS:

This is puzzling because it doesn't fit the CEU SFS at all. The overall scale of the number of segregating sites (S) is wrong. The expected S under Tenessen (and the mutation rate computed as above) is 2,900,838, whereas the actual SFS has 9,392,857 variants.

wsdewitt commented 4 years ago

@kelleyharris Looking at the Speidel et al. paper I see demographies with OOA bottlenecks that start about 200 kya:

Although the text of the paper states that they find the TCC pulse 10-20kya (roughly consistent with your paper), if you look at the plot it starts almost 100kya:

The pulse peaks around 20kya, whereas yours started at 15kya (I think the way Leo estimates mutation spectrum changes is biased, but I digress)

If I proceed with MUSH inference assuming the Tennessen demography (which was assumed in the eLife paper), I get close to reproducing your timing of the TCC pulse without any parameter fiddling:

But the Tennessen demography is wildly mis-specified to the total SFS from 1KG data, and this is going to introduce erroneous time scaling. Mushi and Relate both jointly infer demography on the same data used to infer mutation spectrum histories, and their timescales of events (OOA and TCC) roughly agree.

wsdewitt commented 4 years ago

Here's another paper with a SMC++ plot with bottlenecks that are timed like ours (note: time is generations, so multiply by 29 in your head): https://www.nature.com/articles/s41588-018-0215-8

kelleyharris commented 4 years ago

Box 2 in the Narasimhan et al. South Asian ancestry paper contains a lot of detail about different migrations between Europe, East Asia, and South Asia. At the very bottom, they say that "Steppe ancestry in modern South Asians is primarily from males." This implies that if the steppe migration was what carried the europulse to south asia, there should be less europulse signal on the south asian X chromosome compared to the autosomes, a testable prediction. The europulse signal is in general attenuated on the X chromosome, but is definitely still present at least in Europeans. https://science.sciencemag.org/content/365/6457/eaat7487.long

wsdewitt commented 4 years ago

After some discussion, it seems that the older europulse timing has a good chance of being correct. There are a few things to check, and the big question will be how is it that East Asians have no TCC pulse, when they diverged from Europeans after the revised europulse began. The answer may be that the europulse was structured by population in ancient EUR (farmer Vs hunter-gatherer groups). Modern EUR are admixtures of all the ancient EUR groups, so whichever one ancient group harbored the TCC pulse, it is present in all modern EUR. But modern East Asians likely have affinity to one or another of these groups that existed as separate populations at the time of the EUR-EAS divergence.

To do:

[x] Quantify how certain we are about europulse timing: is it robust to regularization, e.g. L1 Vs L2?
[ ] Why is Tennesen so poorly mis-specified to the SFS? What mutation rate did they use? Does it fit better with a higher rate and/or shorter generation time?
[ ] The europulse is also present in South Asians. How does the timing of SAS/EUR admixture compare to the revised europulse timing? See Vagheesh Narasimhan paper on SAS/EUR admixture.
[ ] Ancient population affinities of 1KG individuals, especially EAS, may identify (or exclude) candidate ancient populations in which the europulse began. Does this concord with TCC fractions computed in ancient genomes from the Willerslev lab?

wsdewitt commented 4 years ago

^^ @kelleyharris Wow cool idea! I could pretty easily run Mushi to infer the TCC mutation history on X Vs autosomes in SAS (and also in other pops as a null)

kelleyharris commented 4 years ago

I think that'd be great! Given that European migration into the Americas was also sex-biased, we might also expect to see less mutation spectrum affinity between Europe and the Americans on the X chromosome.

kelleyharris commented 4 years ago

Figure 2c in the SGDP paper uses the MSMC to infer that the European/East Asian split took place between 19 and 37 kya https://www.nature.com/articles/nature18964

kelleyharris commented 4 years ago

At the same time, if you look at the cross-coalescence profile of the French/Han Chinese split, the cross-coalescence rate hovers around 90% for the whole period or 100-200 kya. That might support our population structure hypothesis (i.e. that a component of the European population separated from East Asians much earlier)

wsdewitt commented 4 years ago

I'm not used to reading these plots: does y=1 mean the two pops are panmictic? What's the nu=4.3e-10 on the time axis label?

kelleyharris commented 4 years ago

Yeah, y=1 means that at time t, a pair of lineages originating in the two separate populations is 100% as likely to coalesce as a pair of lineages originating in the same population. Good question about the x axis label though, i'm confused too. Usually the MSMC x axis label is just kya, same as the x axis label of PSMC plots.

kelleyharris commented 4 years ago

ah according to the supplement, nu is mu divided by the generation time, so it's just specifying a mutation rate estimate

wsdewitt commented 4 years ago

I wonder if the older timing of the TCC pulse would alter Guy Sella's theoretical conclusions excluding the possibility of a mutator allele being the cause.

wsdewitt commented 4 years ago

Checking off the first box on the todo list above, here is a demonstration of robustness of pulse timing to regularization. The choice boils down to how much L2 Vs L1 penalization is used on the time derivative. The rows in this plot are increasing L2 smoothness. The left column shows the 3-SFS data for TCC>TTC (dots ) and the fit (lines). The right column shows the inferred TCC component. The first row is maximally sharp (a very small L2 coefficient just to induce strong convexity), and the last row has too much L2 smoothness (as the fit is poor).

The timing of the start of the pulse seems to be fairly robust to the smoothness choice. If we exclude the last row due the poor fit, it seems to be starting ~80-120kya and ending ~3-10kya, with a less complete recovery in Finns.

kelleyharris commented 4 years ago

I'm not sure the altered timing would change Guy Sella's theoretical conclusions, since the point of his story was more that a genetic mutator that affected all TCC sites would likely be too deleterious to last for long. However, the result might change David Reich's conclusion that the pulse is too fast to have a genetic cause.

Is there already a plot showing that using the tennessen et al model yields the more recent and fast pulse timing?

wsdewitt commented 4 years ago

^Here's the inferred TCC pulse assuming the Tennessen demography, with the panels from left to right showing:

terrible fit to the total SFS
excellent fit to the relative SFS for TCC
more recent and rapid TCC pulse
For visual reference here's the one from H&P:

kelleyharris commented 4 years ago

that's great, i'm glad we have such a convincing and clear explanation of where the revision of the pulse timing likely comes from

wsdewitt commented 4 years ago

Yeah but I still don't know why the Tennessen demography fits the 1KG SFS so poorly, so it's not totally clear to me yet.

Another weird observation, here's the EUR Tennessen demography in stdpopsim, showing only one bottleneck ~23kya:

And here's figure 2b from Tennessen:

Why don't we see the OOA bottleneck at 51kya in stdpopsim?

Also I wonder if the Tennessen paper might have used a wrong mutation rate by using (site-wise rate) x (genome size) rather than (site-wise rate) x (exome size). The SM doesn't provide any detail about how they did this.

kelleyharris commented 4 years ago

Huh that does seem odd. The model seems correctly parameterized in the stdpopsim documentation. The bottleneck 23kya in your plot is down to a population size of 1,000, which is the correct "CEU pop size after EU/AS divergence." If I'm reading the axis scale right, the constant ancient pop size is 2100, which is supposed to be the constant population size of the out of Africa population after the split 51 kya. So what's missing is the oldest ancient population size epoch, which could explain why the predicted number of SNPs is too low. Do you want to flag this for the stdpopsim team?

wsdewitt commented 4 years ago

If I plot both the AFR and EUR populations in the stdpopsim model, I get this:

So maybe what I'm supposed to do is switch over to plotting from the AFR population at the time of OOA event (2,040 generations ago). It seems weird that that would be necessary, but maybe I'm not grokking msprime models?

kelleyharris commented 4 years ago

I guess the real question is whether simulating the European population on its own gives you the correct European model or the one that's plotted by default. It's probably worth forcing the simulation to be like you suggest and if that increases the heterozygosity we'll know this is a bug they need to fix

wsdewitt commented 4 years ago

I took a look at how the the PopSim Consortium did things for their preprint. This line in their repo for the paper's plots (hooray pre-print stage open sci!) shows that they generate the population size history plots using the coalescence_rate_trajectory() method in the msprime API, rather than the population_size_trajectory() method I had been using. When I make this change things look as expected.

Running mushi gives the following. It's still not quite giving enough variants (about half of what we observe).

wsdewitt commented 4 years ago

Although Tennessen et al. don't report the mutation rate they used, the Gutenkunst paper reports a rate of 2.35e-8, based on phylogenetic calibration. This is about 2-fold higher than the current more accurate estimates based on trio data, so if Tennessen et al. used a similar rate that would fix the problem of the scale of the SFS (total number of segregating sites) not being correct.

As an aside, the popsim API associates mutation rate with a genome (1.29e-8 for humans), but not with each demographic model for that genome. It seems like you'd want the latter since those demographies are inferred using assumptions about mutation rate (which differ by model, most using a significantly higher rate than 1.29e-8, it seems).

kelleyharris commented 4 years ago

Yes if Tennessen et al were in fact using twice as high a mutation rate that should take care of the discrepancy. If you use the Gutenkunset et al mutation rate to generate your figure (following the paper trail outlined above), then the story becomes that the Tennessen model fits the SFS ok and yields a recent pulse, but the mushi model fits even better and predicts a more ancient pulse?

wsdewitt commented 4 years ago

Yeah, if I use the Gutenkunst rate, the Tennesen demography fits better in terms of scale (but not great more generally speaking):

Here's the observed Vs predicted number of segregating sites (still undershooting a bit, but much closer):

If I condition on this demography (which evidently was inferred assuming a rate like 2.35e-8) and infer a mush using the more recent rate 1.3e-8 (I'm guessing that's analogous to what was done in the eLife paper) the TCC pulse looks like this:

So do we conclude that an older rate estimate was cryptically modifying the diffusion timescale in the eLife paper via the assumed demography?

kelleyharris commented 4 years ago

great, given that the tennessen et al model wasn't even fit to whole genome data the mediocrity of that sfs fit looks completely believable to me. that model was fit to genic data (not whole genome data) and genes have a lower density of segregating sites due to elevated background selection

kelleyharris commented 4 years ago

looking closely at publication dates, tennessen et al 2012 came out in july, while scally and durbin's "revising the human mutation rate" paper came out in december 2012. so it would have been surprisingly prescient for tennessen et al to use the lower trio-inferred human mutation rate

wsdewitt commented 4 years ago

Nailed it!

wsdewitt commented 4 years ago

this will get closed when #46 is merged (the notebook lives on that branch)

harrispopgen / mushi

time calibration #40

To do: