Enforcing flatness beyond the coalescent horizon

harrispopgen / mushi

[mu]tation [s]pectrum [h]istory [i]nference

https://harrispopgen.github.io/mushi/

MIT License

24 stars 5 forks source link

Enforcing flatness beyond the coalescent horizon #8

Closed wsdewitt closed 5 years ago

wsdewitt commented 5 years ago

We should heavily penalize the derivative as we approach the coalescent horizon. Here's the expected time to the most recent common ancestor of a sample of n haplotypes, assuming a constant population size eta_0. It's just a sum of the expected intercoalescent times. The coalescent has a lot of variance so we'd want to go out a few sigma beyond the expected T_MRCA. That's also straightforward: the variance is just the sum of the variances of the exponential intercoalescent times.

wsdewitt commented 5 years ago

Although it would be more correct to compute this inhomogeneously by integrating eta(t) for each interval, it might be sufficient to just use the best fit constant population size.

kamdh commented 5 years ago

I like your notebook, but how is it that you compute the horizon zone? What is $T_\sigma$?

kamdh commented 5 years ago

Also, you can probably just do asymptotics and assume n >> 1, and work with the leading order terms

wsdewitt commented 5 years ago

Definitely should get another look, but the idea is that the intercoalescent times are independent exponential rvs, so both the expectations and variances are additive over the interval expectations and variances. Then T_sigma is just sqrt of the var. It would be more correct to compute quantiles, but I'm just doing +/- 2sigma

wsdewitt commented 5 years ago

Also I should have tagged the commit to this issue: b4edd7e8f4f44eeb894d6023d3e42c5f835d21ad

kamdh commented 5 years ago

I don't think the ramp should necessary stop after the horizon time, it should keep increasing.

kamdh commented 5 years ago

Btw, T_expectation is 1.3e4 and 2*T_sigma is 1.4e4

wsdewitt commented 5 years ago

^^ why? ^Yeah so it's left bounded at t=0 in that case

kamdh commented 5 years ago

Before the horizon, we have some hope of seeing changes in eta(t), so we can smooth some but not too much. Near the horizon, we want to increase the smoothing, and this should keep increasing so that the function becomes increasingly smoother as time goes on. Really, we want to ensure that it's constant in the end. So my weights for this would be something constant plus a ramp that starts before the horizon and increases forever. Maybe a nonlinear ramp.

kamdh commented 5 years ago

The other thing to figure out is why we cannot just parametrize the max time as something close to the horizon and then solve a smaller problem. Something about how we have formulated the problem gives us those weird jumps.

wsdewitt commented 5 years ago

^^My intuition is that everything substantially beyond the horizon should be substantially but equally penalized, the idea being that all of that region is equivalent terra incognita for our SFS. It doesn't have to keep getting smoother once we're decidedly in that zone.

kamdh commented 5 years ago

Really, if we want it to be totally flat, it should be infinitely weighted so that the diffs cannot be nonzero.

wsdewitt commented 5 years ago

Um, how's that different from truncating our time grid and using infinite=True as currently implemented?

wsdewitt commented 5 years ago

Commit 0acc3e91 implements the penalty ramp as a linear function of the CDF of the TMRCA under the best fitting constant population size. The CDF is given in Wakeley, eqn 3.39, implemented in dement.py with DemEnt.tmrca_cdf(). The minimum and maximum penalties (lambda_diff_min and lambda_diff_max) are parameters to the inversion function DemEnt.invert().

The penalty ramp is visualized in the inversion plot (blue curve/axis below), which also gives a nice indication of where the (approximate) coalescent horizon is wrt our demography. Here's a demonstration using an exponential growth demography (note log axes).

A possible revision could be to use the inferred demography to compute the CDF, instead of the MLE constant demography. This will require an inhomogeneous analog to Wakeley's eqn 3.39.