jamesonquinn commented 5 years ago

Requesting feature: "Laplace family autoguides"

The existing "AutoLaplaceApproximation" uses a delta guide to find the MLE of x, then creates a Laplace approximation around that MLE. But the resulting Laplace approximation is optimal in no way except its MLE. Either term of its ELBO can be arbitrarily bad.

The feature request I'm making here is for a "Laplace family autoguide". Say we have a model p(θ|x), where θ is the vector of model parameters and x is the data. The guide q_{φ,ψ} would be a multivariate normal with mean φ and precision set to be the observed information of p(θ|x) at θ = φ, with an adjustment parametrized by ψ to ensure that it is positive definite. That is:

q_{φ,ψ} = 𝒩[ φ, (ℐ_θ(φ)+M_ψ)⁻¹],

where

ℐ_θ(φ) := -H[log p(θ|x)] | evaluated at φ,

and M_ψ is a positive definite matrix such that (ℐ+M_ψ) is also positive definite. (The optimal choice of M_ψ is not obvious and would need to be left up to the user; I will discuss one possible choice briefly at the end of this feature request.)

In practice, this would mean that the guide has to make direct calls to the model. Something like:

    hessCenter = pyro.condition(model, phi_dict)
    trace1 = poutine.trace(hessCenter)
    trace2 = trace1.get_trace(*args, **kwargs)
    loss = -trace2.log_prob_sum()
    thetaParts = [v for k,v in phi_dict]
    H = hessian.hessian(loss, thetaParts)
    M_matrix = calculateM(H,psi_scale_params)   # calculateM should be chosen by the user, as explained in appendix

    thetaMean = catToOneVector(thetaparts)
    theta = pyro.sample('theta',
                    dist.MultivariateNormal(thetaMean, precision_matrix=H+M_matrix),
                    infer={'is_auxiliary': True})]

(Note that this pseudocode is drawn from a case where data is treated in an unusual way, so may need to be adjusted to condition on data normally.)

This is the basic idea. To make it more useful, there are several possible extensions: sequential sampling, amortizing, and subsampling:

Sequential sampling refers to the idea that the guide should sample the high-level parameters first, then sample lower-level parameters in conditionally-independent blocks by conditioning on the higher-level parameters. This can substantially reduce the dimension of the matrices involved and thus speed the linear algebra needed for sampling and/or calculating density.
Amortizing refers to the idea that some nuisance parameters in θ could be set deterministically to their MLE conditional on the data and the other parameters. The MLE of these nuisance parameters would need to be calculated twice: once with the parameters above them set to φ to calculate the Hessian; and once with all other parameters in θ set to their sampled values, in order to calculate the density p(θ|x) for the ELBO.
Subsampling refers to the idea that the Hessian for a large amount of iid data could be estimated using a subsample of that data. This would mean that the above pseudocode where the guide calls the model would have to be able to separate the model's prior component (which doesn't need rescaling from subsample N to overall N) from its likelihood component (which does need such rescaling).

Appendix: making the precision matrix positive definite

In order to ensure that the covariance matrix of q is positive definite, we're adding the positive definite matrix M to the observed information ℐ. We construct M as a function of both ℐ and some set of parameters ψ.

One possible construction is to have one positive parameter ψ_i per dimension of ℐ, then define:

M_ii:=ψ_i*logsumexp[0, -ℐ_ii/ψ_i, -[abs(ℐii)-Σ{j≠i}abs(ℐ_ij)] /ψ_i]

This construction ensures that ℐ+M_ψ has positive diagonal entries and is strongly diagonally dominant; these two conditions in turn ensure that it is positive definite. By including each ψ_i into the set of parameters over which the ELBO is maximized, we are also allowing the fitted variational posterior estimate q to have higher curvature than the observed information, even in cases where the observed information is in fact positive definite. This would deal with cases where the true posterior had high kurtosis and thus a Laplace approximation badly overestimates the tail probability.

It may be possible to improve this construction by using a single ψ parameter for multiple "similar" dimensions of ℐ, but discussing this possibility is beyond the scope of this feature request. Constructions using other criteria for positive definiteness are also possible and we hope will be a subject of further research.

fritzo commented 5 years ago

@fehiepsi Long ago we discussed a few different Laplace autoguides, then created AutoLaplace. Do you remember whether we prototyped anything like @jamesonquinn's suggestion?

jamesonquinn commented 5 years ago

One important difference between my suggestion and AutoLaplace that may not have been evident from the above is that the fitted φ is not necessarily a (local) MAP, but rather a value that (locally) maximizes the ELBO for the guide as a whole. This is especially important for things like variance hyperparameters, which often have a pathological MAP at 0 (leading to an ELBO with infinite "energy" [first term] but also negative-infinity "entropy" [second term]).

This makes using the term "Laplace" a bit deceptive for my proposal, since it's not actually a valid Laplace approximation unless φ is a local MAP, though in most cases it will be close to being one.

fehiepsi commented 5 years ago

@fritzo According to my knowledge, Laplace approximation is the MVN guide generated at MAP point. Previously, we discuss about should we do Laplace approximation in constrained or unconstrained spaces and finally decide that we should go with unconstrained space to generate valid samples.

@jamesonquinn I guess that you want to propose a method to make Laplace approximation stable (to deal with the case the hessian is singular)? In that case, I think there are some other better choices than Laplace Approximation. For example, AutoMVN is one of them but it might be slow for high dimensional problems. If you are working with a high dimensional problem, then AutoLowRankMVN is a pretty good choice in my opinion. Its number of parameters is O(kN) while AutoMVN has O(N^2) number of parameters. What do you think?

jamesonquinn commented 5 years ago

To be clear: I'm coding this myself for my specific use case, so the request for an autoguide is more as something I think would be generally useful for others than for my own use.

You say "(to deal with the case the hessian is singular)?". I'd say that though it does deal with problems in the Hessian, that's not the main advantage of my suggestion. The advantages I'd foreground are:

that it's a principled, fully-variational strategy;
that it gets an MVN with covariances tailored to the problem domain;
but that it needs relatively few parameters to do so.

That said, at least for my own case, I think this will work better than any of the existing autoguide options. Here's why, in each case:

AutoLaplace: since this is two separate steps (first find MAP using a simple SVI procedure, then find Laplace approximation around that), there is no guarantee that the resulting distribution is optimal in even the weakest sense (that is, locally, as compared to some parametric family of alternatives). For example, variance hyperparameters will have a pathological MAP at zero, and that could be an attractor for AutoLaplace, while my proposal would avoid that (the pathological MAP would have bad entropy and thus not be a maximum for the ELBO).

AutoDiagonalNormal: This would not account for covariances induced by the data, a serious issue with the problem domain we're using it for.

AutoMVN: Impractical in our case because of dimensionality

AutoLowRankMVN: Though I don't fully understand how this works, my initial impression is that it's basically a kind of compromise between the above two options; and that there's no more reason to expect it to be the best of both worlds, as to be the worst of both. In contrast, the Laplace family idea I've posted above will naturally create a guide tailored to the problem space, with a low number of parameters.

AutoDelta and AutoDiscreteParallel are obviously inappropriate here.

fehiepsi commented 5 years ago

To my knowledge, Laplace approximation is fast because it just does MLE inference and calculate scale_tril from hessian one time before sampling, so the sampling will also be fast. But I know that Laplace approximation has singularity issues when the parameters are correlated. In addition, Laplace approximation just looks at second derivative at MAP point, so it couldn't capture covariance of the whole posterior. So to get a better approximation for posterior, we can use AutoMVN. But AutoMVN is slow due to a lot of parameters are involved. It looks interesting to see how your Laplace autoguide works and resolves those issues. I guess your M_matrix has some special features which make sampling theta (and might be log_prob computation too) fast despite that its distribution is a MVN, so the inference can be faster than AutoMVN. Could you please make a PR which includes your proposed ideas so we can have a closer look at your ideas?

jamesonquinn commented 5 years ago

Laplace approximation just looks at second derivative at MAP point, so it couldn't capture covariance of the whole posterior

While my proposal would center on a maximum-ELBO point and not a MAP point, you're right that it wouldn't capture the covariance of the whole posterior. It's intended for cases where the posterior is probably largely unimodal, and the local curvature at the maximum-ELBO point is probably a good estimate of the precision matrix of the posterior. In other words, it MAY work in cases where the model is generally made up of exponential family distributions; but it probably will NOT work well in cases where the model includes mixture distributions, neural nets, or other such complications.

Could you please make a PR which includes your proposed ideas so we can have a closer look at your ideas?

You mean, as an example? Doing it as an autoguide would be beyond my abilities right now.

fehiepsi commented 5 years ago

You mean, as an example?

Yup, if possible. ^^ I think you don't have to make an autoguide implementation. Adding an example is already great! So it would be easier for us to incorporate your ideas into an implementation in autoguide module.

I think other devs will have a better understanding about this FR, so I also want to hear more opinion too.

fritzo commented 5 years ago

@fehiepsi I think I understand @jamesonquinn's proposal pretty well (we discussed offline); let me sketch this up quick to save @jamesonquinn some time...

fritzo commented 5 years ago

OK, so if I understand correctly, the idea is to use a user-provided covariance/precision/scale_tril estimate. Since we eventually need a cholesky decomposition, let's assume wlog that the user can provide a scale_tril estimate. Then we should be able to implement this feature with a lightweight subclass of AutoMultivariateNormal:

class AutoMultivariateNormalKnownCovariance(AutoMultivariateNormal):
    def __init__(self, model, scale_tril_fn):
        super(AutoMultivariateNormalKnownCovariance, self).__init__(model)
        self.scale_tril_fn = scale_tril_fn
    def get_posterior(self, *args, **kwargs):
        loc = pyro.param("{}_loc".format(self.prefix),
                         lambda: torch.zeros(self.latent_dim))
        values = {site["name"]: biject_to(site["fn"].support)(unconstrained_value)
                for site, unconstrained_value in self._unpack_latent(loc)}
        scale_tril = self.scale_tril_fn(**values)
        # TODO transform this via jacobian(values), possibly using torch.autograd.grad
        return dist.MultivariateNormal(loc, scale_tril=scale_tril)

The remaining TODO concerns the transform between constrained and unconstraned spaces. I think this should be not too difficult, since we have (x,y) pairs in the values dict and we can simply run torch.autograd.grad on those to construct a Jacobian.

jamesonquinn commented 5 years ago

That pretty much covers the basics, but I have a few caveats.

scale_tril_fn needs access to the model. Are you imagining that the function passed in would have a self argument first and would get the model via self.model, or should you add the model as an explicit parameter when calling the function?
I think that the sketch above would eventually have to be substantially changed if you want to handle even the "sequential sampling" part of my initial suggestion, let alone the further "amortization" and/or "subsampling". "Sequential sampling" means that you sample from the multivariate normal a few variables at a time; once for all the global stuff, then once per iid data unit. This avoids having to cholesky decompose high-dimensional matrices.

fritzo commented 5 years ago

scale_tril_fn needs access to the model

I guess you could use either functools.partial(fn, model) or define a model as a class with a scale_tril_fn method and then pass model.scale_tril_fn to the autoguide.

jamesonquinn commented 5 years ago

I'm trying to make a simple demo, and I got an error. The somewhat-stripped-down version of the code that gives the error is at https://forum.pyro.ai/t/multiple-sample-sites-error-help/921

pyro-ppl / pyro

[FR] Laplace family autoguides #1817

Appendix: making the precision matrix positive definite