Open jamesonquinn opened 5 years ago
@fehiepsi Long ago we discussed a few different Laplace autoguides, then created AutoLaplace
. Do you remember whether we prototyped anything like @jamesonquinn's suggestion?
One important difference between my suggestion and AutoLaplace
that may not have been evident from the above is that the fitted φ is not necessarily a (local) MAP, but rather a value that (locally) maximizes the ELBO for the guide as a whole. This is especially important for things like variance hyperparameters, which often have a pathological MAP at 0 (leading to an ELBO with infinite "energy" [first term] but also negative-infinity "entropy" [second term]).
This makes using the term "Laplace" a bit deceptive for my proposal, since it's not actually a valid Laplace approximation unless φ is a local MAP, though in most cases it will be close to being one.
@fritzo According to my knowledge, Laplace approximation is the MVN guide generated at MAP point. Previously, we discuss about should we do Laplace approximation in constrained or unconstrained spaces and finally decide that we should go with unconstrained space to generate valid samples.
@jamesonquinn I guess that you want to propose a method to make Laplace approximation stable (to deal with the case the hessian is singular)? In that case, I think there are some other better choices than Laplace Approximation. For example, AutoMVN is one of them but it might be slow for high dimensional problems. If you are working with a high dimensional problem, then AutoLowRankMVN is a pretty good choice in my opinion. Its number of parameters is O(kN) while AutoMVN has O(N^2) number of parameters. What do you think?
To be clear: I'm coding this myself for my specific use case, so the request for an autoguide is more as something I think would be generally useful for others than for my own use.
You say "(to deal with the case the hessian is singular)?". I'd say that though it does deal with problems in the Hessian, that's not the main advantage of my suggestion. The advantages I'd foreground are:
That said, at least for my own case, I think this will work better than any of the existing autoguide options. Here's why, in each case:
AutoLaplace: since this is two separate steps (first find MAP using a simple SVI procedure, then find Laplace approximation around that), there is no guarantee that the resulting distribution is optimal in even the weakest sense (that is, locally, as compared to some parametric family of alternatives). For example, variance hyperparameters will have a pathological MAP at zero, and that could be an attractor for AutoLaplace, while my proposal would avoid that (the pathological MAP would have bad entropy and thus not be a maximum for the ELBO).
AutoDiagonalNormal: This would not account for covariances induced by the data, a serious issue with the problem domain we're using it for.
AutoMVN: Impractical in our case because of dimensionality
AutoLowRankMVN: Though I don't fully understand how this works, my initial impression is that it's basically a kind of compromise between the above two options; and that there's no more reason to expect it to be the best of both worlds, as to be the worst of both. In contrast, the Laplace family idea I've posted above will naturally create a guide tailored to the problem space, with a low number of parameters.
AutoDelta and AutoDiscreteParallel are obviously inappropriate here.
To my knowledge, Laplace approximation is fast because it just does MLE inference and calculate scale_tril
from hessian one time before sampling, so the sampling will also be fast. But I know that Laplace approximation has singularity issues when the parameters are correlated. In addition, Laplace approximation just looks at second derivative at MAP point, so it couldn't capture covariance of the whole posterior. So to get a better approximation for posterior, we can use AutoMVN. But AutoMVN is slow due to a lot of parameters are involved. It looks interesting to see how your Laplace autoguide works and resolves those issues. I guess your M_matrix
has some special features which make sampling theta
(and might be log_prob computation too) fast despite that its distribution is a MVN, so the inference can be faster than AutoMVN. Could you please make a PR which includes your proposed ideas so we can have a closer look at your ideas?
Laplace approximation just looks at second derivative at MAP point, so it couldn't capture covariance of the whole posterior
While my proposal would center on a maximum-ELBO point and not a MAP point, you're right that it wouldn't capture the covariance of the whole posterior. It's intended for cases where the posterior is probably largely unimodal, and the local curvature at the maximum-ELBO point is probably a good estimate of the precision matrix of the posterior. In other words, it MAY work in cases where the model is generally made up of exponential family distributions; but it probably will NOT work well in cases where the model includes mixture distributions, neural nets, or other such complications.
Could you please make a PR which includes your proposed ideas so we can have a closer look at your ideas?
You mean, as an example? Doing it as an autoguide would be beyond my abilities right now.
You mean, as an example?
Yup, if possible. ^^ I think you don't have to make an autoguide implementation. Adding an example is already great! So it would be easier for us to incorporate your ideas into an implementation in autoguide module.
I think other devs will have a better understanding about this FR, so I also want to hear more opinion too.
@fehiepsi I think I understand @jamesonquinn's proposal pretty well (we discussed offline); let me sketch this up quick to save @jamesonquinn some time...
OK, so if I understand correctly, the idea is to use a user-provided covariance/precision/scale_tril estimate. Since we eventually need a cholesky decomposition, let's assume wlog that the user can provide a scale_tril
estimate. Then we should be able to implement this feature with a lightweight subclass of AutoMultivariateNormal:
class AutoMultivariateNormalKnownCovariance(AutoMultivariateNormal):
def __init__(self, model, scale_tril_fn):
super(AutoMultivariateNormalKnownCovariance, self).__init__(model)
self.scale_tril_fn = scale_tril_fn
def get_posterior(self, *args, **kwargs):
loc = pyro.param("{}_loc".format(self.prefix),
lambda: torch.zeros(self.latent_dim))
values = {site["name"]: biject_to(site["fn"].support)(unconstrained_value)
for site, unconstrained_value in self._unpack_latent(loc)}
scale_tril = self.scale_tril_fn(**values)
# TODO transform this via jacobian(values), possibly using torch.autograd.grad
return dist.MultivariateNormal(loc, scale_tril=scale_tril)
The remaining TODO concerns the transform between constrained and unconstraned spaces. I think this should be not too difficult, since we have (x,y)
pairs in the values
dict and we can simply run torch.autograd.grad
on those to construct a Jacobian.
That pretty much covers the basics, but I have a few caveats.
scale_tril_fn
needs access to the model. Are you imagining that the function passed in would have a self
argument first and would get the model via self.model
, or should you add the model as an explicit parameter when calling the function?
scale_tril_fn
needs access to the model
I guess you could use either functools.partial(fn, model)
or define a model as a class with a scale_tril_fn
method and then pass model.scale_tril_fn
to the autoguide.
I'm trying to make a simple demo, and I got an error. The somewhat-stripped-down version of the code that gives the error is at https://forum.pyro.ai/t/multiple-sample-sites-error-help/921
Requesting feature: "Laplace family autoguides"
The existing "AutoLaplaceApproximation" uses a delta guide to find the MLE of x, then creates a Laplace approximation around that MLE. But the resulting Laplace approximation is optimal in no way except its MLE. Either term of its ELBO can be arbitrarily bad.
The feature request I'm making here is for a "Laplace family autoguide". Say we have a model p(θ|x), where θ is the vector of model parameters and x is the data. The guide q_{φ,ψ} would be a multivariate normal with mean φ and precision set to be the observed information of p(θ|x) at θ = φ, with an adjustment parametrized by ψ to ensure that it is positive definite. That is:
q_{φ,ψ} = 𝒩[ φ, (ℐ_θ(φ)+M_ψ)⁻¹],
where
ℐ_θ(φ) := -H[log p(θ|x)] | evaluated at φ,
and M_ψ is a positive definite matrix such that (ℐ+M_ψ) is also positive definite. (The optimal choice of M_ψ is not obvious and would need to be left up to the user; I will discuss one possible choice briefly at the end of this feature request.)
In practice, this would mean that the guide has to make direct calls to the model. Something like:
(Note that this pseudocode is drawn from a case where data is treated in an unusual way, so may need to be adjusted to condition on data normally.)
This is the basic idea. To make it more useful, there are several possible extensions: sequential sampling, amortizing, and subsampling:
Appendix: making the precision matrix positive definite
In order to ensure that the covariance matrix of q is positive definite, we're adding the positive definite matrix M to the observed information ℐ. We construct M as a function of both ℐ and some set of parameters ψ.
One possible construction is to have one positive parameter ψ_i per dimension of ℐ, then define:
M_ii:=ψ_i*logsumexp[0, -ℐ_ii/ψ_i, -[abs(ℐii)-Σ{j≠i}abs(ℐ_ij)] /ψ_i]
This construction ensures that ℐ+M_ψ has positive diagonal entries and is strongly diagonally dominant; these two conditions in turn ensure that it is positive definite. By including each ψ_i into the set of parameters over which the ELBO is maximized, we are also allowing the fitted variational posterior estimate q to have higher curvature than the observed information, even in cases where the observed information is in fact positive definite. This would deal with cases where the true posterior had high kurtosis and thus a Laplace approximation badly overestimates the tail probability.
It may be possible to improve this construction by using a single ψ parameter for multiple "similar" dimensions of ℐ, but discussing this possibility is beyond the scope of this feature request. Constructions using other criteria for positive definiteness are also possible and we hope will be a subject of further research.