Use "real" coalescent simulation, instead of just the forward model

harrispopgen / mushi

[mu]tation [s]pectrum [h]istory [i]nference

https://harrispopgen.github.io/mushi/

MIT License

24 stars 5 forks source link

Use "real" coalescent simulation, instead of just the forward model #11

Closed wsdewitt closed 4 years ago

wsdewitt commented 5 years ago

Popsim

kamdh commented 5 years ago

In other words, you're testing our inference on a forward model that's more sophisticated? This will tell us whether the framework we are using is robust to model mis-specification.

wsdewitt commented 5 years ago

Yep, those sims generate coalescent trees, then mutations on them, according to realistic demographies with complications like linkage and population structure.

wsdewitt commented 5 years ago

A first place to look may be the starting kit for ∂a∂I (Gutenkunst et al.), which includes some simulated eta(t) and SFS.

When we start trying to invert mutation spectrum evolution (see #12) we'll want to look at:

Kelley has access to a beta version of SLiM (forward model simulator) that simulates mutation spectra.
Some code from Jeff Spence (emailed) shows how to simulate time dependent mutation spectra over msprime coalescent trees.

kamdh commented 5 years ago

Before jumping into this, I'd be sure to test the inference method where the model is correct more thoroughly, to be sure everything is working.

wsdewitt commented 5 years ago

Adding a few items on this issue from todays mushi chat:

[ ] we want to simulate the effect of GC-biased gene conversion, and see if we can separate it from real signatures (see #14)
[ ] population structure as misspecification
[ ] how do errors in our belief about $\eta$ affect our $\mu$ inference?

wsdewitt commented 5 years ago

When we simulate with msprime (linkage disequilibrium misspecification) we see funky outlier points in the SFS. These are due to deep branches in the coalescent that don’t get rearranged, so we effectively sample only one tree giving the same frequency for all mutations on that branch. If we sampled more trees, the outliers would smooth out into the neighboring frequency categories. One way to deal with this is to coarse-grain bin the SFS at higher frequencies (as done in fastneutrino)

wsdewitt commented 5 years ago

I think binning is pretty straightforward in terms of the PRF log likelihood, since the expectation of a bin will be the sum of expectation of the elements in the bin Scanned Document 2

wsdewitt commented 5 years ago

PR #22 includes an implementation of frequency binning as sketched above. Here's an example output plot showing the bins as vertical lines.

wsdewitt commented 4 years ago

Implemented using msprime and stdpopsim in test-msprime.ipynb