Variational distributions for mutation ages

nspope commented 1 year ago

The "variational_gamma" algorithm produces gamma approximations for posterior distributions of node ages. It'd also be nice to have a way to get gamma approximations for posterior distributions of mutation ages. This would nicely encode the uncertainty in mutation age due to uncertainty in parent/child ages and the location of the mutation on the branch, in a single package.

It's pretty straightforward to do this: the typical EP target has the form,

$$p(t_i | \alpha_i, \beta_i) p(t_j | \alpha_j, \betaj) \ell(y{ij} |(t_i - tj) \mu{ij})$$

where $t_i, tj$ are parent/child ages; $\alpha, $\beta$ are variational parameters; and $y{ij}, \mu_{ij}$ are mutation count / mutation rate for the edge. Here $p(y | a, b)$ is a gamma PDF with shape/rate $a,b$; and $\ell(y | \lambda)$ is a Poisson PMF with rate $\lambda$.

Given this target, if a given mutation is equally likely to show up anywhere on the branch, then the distribution of its age $x$ is:

$$f(x | \alpha, \beta) \propto \int_0^\infty \int_0^{t_i} \frac{\mathbb{I}[t_j < x < t_i]}{t_i - t_j } p(t_i | \alpha_i, \beta_i) p(t_j | \alpha_j, \betaj) \ell(y{ij} | (t_i - tj) \mu{ij}) dt_j dt_i$$

It's fairly straightforward to calculate moments / logarithmic moments of $x$, and then do gamma moment matching to get the variational approximation (the moments are linear combinations of the ${_2}F_1$ used for the regular EP updates, with integer perturbations to arguments). This is done only once, when the regular EP routine has finished, and gives shape/rate parameters for each mutation.

As a proof of concept, I simulate 100 diploids on 1 Mb and use the true ARG for dating, then plot posterior mean of mutation age vs true mutation age:

Of course, this doesn't give a sense for how well the variational posterior encodes uncertainty. To get at that, I calculate posterior coverage across a range of quantiles; e.g. if I take the $0.25$ and $0.75$ quantiles for each variational posterior, the true age should lie within these quantiles approximately $50 \%$ of the time. If the variational posterior is a good match to the true posterior, then the coverage should be roughly 1-to-1 across all quantiles:

So, the approximation looks quite good in the case where the true ARG topology is known. (It deteriorates noticably for tsinfer'd ARGs -- trying to figure out why.) I see this being useful for two reasons: (1) people are often interested in mutation ages rather than ancestor ages; (2) for assessing the accuracy of dating, working with mutations is nice because they map straightforwardly between true/inferred ARGs and can be stratified by # of descendants.

nspope commented 1 year ago

Here's the same figures / simulation, but with tsinfer'd ARG -- notice the posterior coverage drops (the posterior is too concentrated). I'm a bit puzzled by this, but it could be due to polytomies -- when there's polytomies, the total edge area subtended by a node is greater than when trees are fully resolved, which will impact the posterior variance more than the mean.

nspope commented 1 year ago

but it could be due to polytomies

This doesn't actually seem to be the case -- downsampling edges in each tree to 2/parent doesn't really change anything (aside from a slight decrease in RMSE). It seems like there's just more error overall when using the tsinfer'd sequences, but the uncertainty estimates (variance of posterior) are generally of the same magnitude. Not sure this can be addressed by the dating algorithm alone.

hyanwong commented 1 year ago

I wonder how this would do with topologies inferred using Relate (and then dated using tsdate). Also, I wonder how it does with tsinferred tree sequence if unary nodes are kept in the tree sequence (or added using extend_edges)?

nspope commented 1 year ago

if unary nodes are kept in the tree sequence (or added using extend_edges)

That's a good question -- however, talking to @petrelharp, extend_edges likely won't place mutations that don't have ages, because otherwise there's no way to know where specific mutations fall above/below unary nodes on an edge. But maybe there's a multi-step procedure or augmented EP update that'd work.

hyanwong commented 1 year ago

Ah, true about extend_edges, but tsinfer will, I think, place mutations either above or below unary nodes, depending on the inferred ancestral haplotype. Actually I should check on this, as if it it's not doing that, it would explain why we sometimes do worse when tsdating unsimplified tsinfer output.

petrelharp commented 1 year ago

Ah - what information is there that tsinfer could be using to place mutations above or below unary nodes? I can't think of how it that would be identifiable - the mutation above/below the node corresponds to "did that ancestral haplotype carry that mutation"; but since it's unary, either way everyone who has inherited from that ancestral haplotype has got the mutation.

hyanwong commented 1 year ago

The ancestor building algorithm infers an ancestral haplotype, which could include a site at which a mutation has occurred, and that regions of haplotype could end up being unary in the resulting tree sequence. In other words, information about the haplotype state is ported from adjacent sites. I could imagine how this could result in a haplotype with a mutation above a unary node, although I admit that I'm struggling to come up with a simple example.

I'm pretty sure this does happen in complex inferences, though, especially where we allow mismatch (i.e. sequencing error). Would it be enlightening if I were to run some relatively simple inferences until I found an example?

nspope commented 1 year ago

Even if we can't place mutations precisely on edges separating unary nodes, these still carry information that can be used in the EP update. For example, if edge $i \rightarrow j$ is broken by unary node $k$, then the local EP target is,

$$\mathcal{G}a(t_i | \alpha_i, \beta_i) \times \mathcal{G}a(t_j | \alpha_j, \betaj) \times \mathcal{P}o(y{ij} | \lambda = (t_i - tj) \mu{ij}) \times \mathbb{I}[t_j < t_k < t_i] \mathcal{G}a(t_k | \alpha_k, \beta_k)$$

In words, the same Poisson likelihood is used as for the non-unary update (e.g. corresponding to the total edge length from non-unary $i$ to non-unary $j$), but we know that the unary node $k$ must be intermediate between $i$ and $j$ and we have some external information on the age of $k$ encoded in a gamma distribution.

Integrating this is tedious, but it ultimately turns into a hypergeometric series (of a more complicated form than the ${_2}F_1$ used in the typical case). This'll be a bit of work to put together but is probably the proper way to handle unary nodes.

petrelharp commented 1 year ago

I'm pretty sure this does happen in complex inferences, though, especially where we allow mismatch (i.e. sequencing error). Would it be enlightening if I were to run some relatively simple inferences until I found an example?

I don't doubt that tsinfer does put mutations on these haplotypes - this makes sense - I'm just wondering if we should assign any weight to the way in which it has done this? I think my argument above says that no, there isn't any possible way it coudl distinguish?

hyanwong commented 1 year ago

I don't doubt that tsinfer does put mutations on these haplotypes - this makes sense - I'm just wondering if we should assign any weight to the way in which it has done this? I think my argument above says that no, there isn't any possible way it coudl distinguish?

Aha. Yes, that is possible. I can't see a way that it could distinguish this either, and am struggling to come up with a counterexample, but I've been wrong on this sort of thing before. The interaction between the distribution of mutations at adjacent sites can be quite complex.

If we can prove that there is no information from adjacent sites that can be used to place the mutation either above or below the node, it would explain some of the problems we have with adding unary nodes (see #288 for context).

hyanwong commented 5 months ago

I think the original issue described here has been addressed nw, right @nspope ? Should we close this issue?

There is some discussion here about how to date mutations above locally unary nodes: I guess we could open another issue about that though?

nspope commented 5 months ago

Yup let's close -- I think unary nodes are off the table for now.

tskit-dev / tsdate

Variational distributions for mutation ages #312