Markov random field priors

wsdewitt commented 5 years ago

With our current elastic net (L1 and L2) penalization we will not be able to recover histories that mix pulses and smooth features, we can only do one or the other. This motivates thinking about more richly parameterized prior distributions on the history.

One approach is to use a Markov random field prior. The idea is to penalize on L2 of the first differences, but with a different penalty weight lambda_i for each first difference i. The weight will be large where the curve is smooth, and small where it is pulsey.

Of course, this doubles the number of parameters we need to infer (history and the lambdas), but we shrink the lambdas by specifying a hyper prior such that they are drawn iid from a horseshoe (also called spike-and-slab) distribution. Adding to this a prior on the boundary value z_1, we have a fully specified prior.

Below is probabilistic graphical model (in the notation of the manuscript), of what I have in mind. Another advantage of this model is that it is differentiable, so would be amenable to fully Bayesian posterior inference using Hamiltonian Monte Carlo. Scanned Document 2

Andy Magee sent me some relevant papers on his horsehoe work with Vladimir and others:

Coalescent MRF models: https://arxiv.org/pdf/1808.04401.pdf The first Horseshoes for MRF paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5942601/ This one has a nice figure that shows how a Laplace MRF behaves almost exactly like a Gaussian MRF, which is to say that in Bayesian models we can't really see an L1 vs L2 difference

Skimming it, I think the first paper has most of what I was remembering Jim saying about how to translate penalized likelihood to Bayesian priors Oh, and in "results I don't think exist in publications but that might be useful" I've been told that a Cauchy MRF model often behaves rather like a Horseshoe MRF. So if you're looking for a simpler functional form to penalize your local lambdas, perhaps you could use a Cauchy? Bonus: if x ~ Cauchy(0,s), 1/x ~ Cauchy(0,1/s), so this should work regardless of whether you use lambda or 1/lambda

wsdewitt commented 5 years ago

An example of this problem

foo

kamdh commented 5 years ago

1.) MRF is weird terminology for me, but I'll go with it. To be clear, I'm going to use "L2 of first differences" or "L1 of first differences" or "L2 of 2nd differences". 2) For the example figure, what happens if you do L2-1st diff ? Or L2-2nd diff? Can you get the smooth part? 3) Going full-Bayesian is a fairly significant methodological complication to what we've planned. It moves us away from a more straightforward optimization of loss + regularization approach, which is a nice place to be especially with the linear forward model that we have. We can do some analysis there (SVD of operator). 4) Maybe a simpler approach is to work on some non-convex sparsity penalties. There is a way we can implement L0+1st diff (we can compute its prox, even though it is non-convex) which is a way to get truly piecewise constant trajectories. (L1 introduces some biases and can sometimes give you lots of small differences rather than one big difference.) I'm not entirely sure, but there may be a way to do L0+2nd diff. Other Lp "norms" for p < 1 are sometimes used too. 5) Really the model you'd want, if you want to allow for both jumps & smoothness in between is a sort of mixture prior on the differences from time point to time point. Sometimes it jumps, but otherwise it's smooth. So that would be something like a gaussian + heavy-tailed mixture ditribution. The gaussian has larger mixture weight but small variance, whereas the heavy-tailed has smaller weight but large variance. The only simulated data close to this is the "broken exponential" example in the arxiv paper you linked to. But that's a very specific form. If you look at their figure 3, you'll actually see that none of the methods fit it very well. We should dig more into the literature to see if there are regularizations that give these effects.

To conclude, I think it's worth thinking about other priors, but I would prefer to stick to regularization-based approaches, at least for this paper. Doing MCMC has a lot of complications over doing optimization like we are, especially convex optimization.

Especially since this is a whole new framework (mutation rate inference), we don't have to try and do "all the things".

wsdewitt commented 5 years ago

1) "MRF" terminology is from hierarchical modeling literature, so a prior on \mu(t), rather than our penalized likelihood setting 🤷‍♀ 2) Yeah if I trade L1 for L2 we do better on the smooth part and worse on the pulse. I haven't tried penalizing 2nd diffs. 3) Not suggesting we do Bayesian inference now. The MRF priors can motivate our likelihood penalization, and EM would be a natural choice for computing the maximum marginal likelihood of the data (integrating out the local scales). 4) Not sure I see L0+1st diff helping this particular issue, since we'd not want truly piecewise constant 5) The MRF prior on first differences, with a horseshoe hyperprior, does allow for "jumps & smoothness in between". Is your point that using a mixture instead of the horseshoe will do better? I also notice that the prior on 1st diffs conditioned on the local scales is gaussian, which is tantamount to local L2 penalties. I wonder if this could instead be done with local L1 penalties.

Another place to look might be trend filtering, which L1 penalizes the second differences, resulting in piecewise linear solutions.

kamdh commented 5 years ago

The MRF prior on first differences, with a horseshoe hyperprior, does allow for "jumps & smoothness in between"

Can you show me this working/point to a figure?

wsdewitt commented 5 years ago

I'd say the broken exponential in fig 3 is performing well. The true curve is within the credible interval for all times, even if the posterior median smooths the corner a bit

kamdh commented 5 years ago

Eh, the solid line is still smoothing out the bumps. I agree the error bars look better, but our method will just be giving us one curve. So smoothing out the bumps might be something we have to deal with.

I don't think we should be stressing about this too much.

harrispopgen / mushi

Markov random field priors #17

An example of this problem