Closed wsdewitt closed 5 years ago
Although it would be more correct to compute this inhomogeneously by integrating eta(t) for each interval, it might be sufficient to just use the best fit constant population size.
I like your notebook, but how is it that you compute the horizon zone? What is $T_\sigma$?
Also, you can probably just do asymptotics and assume n >> 1, and work with the leading order terms
Definitely should get another look, but the idea is that the intercoalescent times are independent exponential rvs, so both the expectations and variances are additive over the interval expectations and variances. Then T_sigma is just sqrt of the var. It would be more correct to compute quantiles, but I'm just doing +/- 2sigma
Also I should have tagged the commit to this issue: b4edd7e8f4f44eeb894d6023d3e42c5f835d21ad
I don't think the ramp should necessary stop after the horizon time, it should keep increasing.
Btw, T_expectation is 1.3e4 and 2*T_sigma is 1.4e4
^^ why? ^Yeah so it's left bounded at t=0 in that case
Before the horizon, we have some hope of seeing changes in eta(t), so we can smooth some but not too much. Near the horizon, we want to increase the smoothing, and this should keep increasing so that the function becomes increasingly smoother as time goes on. Really, we want to ensure that it's constant in the end. So my weights for this would be something constant plus a ramp that starts before the horizon and increases forever. Maybe a nonlinear ramp.
The other thing to figure out is why we cannot just parametrize the max time as something close to the horizon and then solve a smaller problem. Something about how we have formulated the problem gives us those weird jumps.
^^My intuition is that everything substantially beyond the horizon should be substantially but equally penalized, the idea being that all of that region is equivalent terra incognita for our SFS. It doesn't have to keep getting smoother once we're decidedly in that zone.
Really, if we want it to be totally flat, it should be infinitely weighted so that the diffs cannot be nonzero.
Um, how's that different from truncating our time grid and using infinite=True
as currently implemented?
Commit 0acc3e91 implements the penalty ramp as a linear function of the CDF of the TMRCA under the best fitting constant population size. The CDF is given in Wakeley, eqn 3.39, implemented in dement.py
with DemEnt.tmrca_cdf()
. The minimum and maximum penalties (lambda_diff_min
and lambda_diff_max
) are parameters to the inversion function DemEnt.invert()
.
The penalty ramp is visualized in the inversion plot (blue curve/axis below), which also gives a nice indication of where the (approximate) coalescent horizon is wrt our demography. Here's a demonstration using an exponential growth demography (note log axes).
A possible revision could be to use the inferred demography to compute the CDF, instead of the MLE constant demography. This will require an inhomogeneous analog to Wakeley's eqn 3.39.
We should heavily penalize the derivative as we approach the coalescent horizon. Here's the expected time to the most recent common ancestor of a sample of n haplotypes, assuming a constant population size eta_0. It's just a sum of the expected intercoalescent times. The coalescent has a lot of variance so we'd want to go out a few sigma beyond the expected T_MRCA. That's also straightforward: the variance is just the sum of the variances of the exponential intercoalescent times.