stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

Switch to turn Jacobian on/off for optimization #2473

Closed bob-carpenter closed 1 year ago

bob-carpenter commented 6 years ago

Summary:

Add an option to services to include or exclude log Jacobian determinant of the unconstrained to constrained transform.

Description

Right now, we automatically disable the Jacobian for optimization, which produces maximum posterior mode estimates on the constrained scale. This is the maximum a posteriori estimate most users are going to expect.

It would be nice to be able to do max posterior mode estimates on the unconstrained scale. The primary motivation is Laplace approximations on the unconstrained scale, which can be sampled from and then mapped back to the constrained scale.

This will have to start by adding a boolean argument apply_jacobian to the service function lbfgs() in lbfgs.hpp for turning the Jacobian off or on. The default value should be false.

Then, the logic will have to be traced down through the optimizer, which is an instance of optimization::BFGSLineSearch<Model, optimization::LBFGSUpdate<> > (confusingly defined in in bfgs.hpp, not in bfgs_linesearch.hpp).

From the line search, the flag has to get passed down to instantiate the call to log_prob_propto, which is now hard coded to log_prob_propto<false>. That will have to be done through the class ModelAdaptor, which should get a template parameter for applying Jacobian.

I'm not sure how high up that template parameter should go. At some point, the boolean variable in the services call will have to be used in a conditional to determine which version of log_prob_propto to call:

if (apply_jacobian)
  log_prob_propto<true>(...);
else
  log_prob_propto<false>(...);

because we can't use a dynamic variable as a template parameter.

Current Version:

v2.17.1

mbrubake commented 6 years ago

Adding the Jacobian here won't do MAP as it is already doing MAP without the Jacobian. It would only be Maximum Likelihood if there were no priors specified in the model.

What exactly is the use case for this feature? I'm hard pressed to understand why someone would want this as the results depend on exactly how we define the constraining transformations which could be changed.

bob-carpenter commented 6 years ago

@avehtari requested this feature, so maybe he can comment.

It's a bit opaque to users what's going on now, so one thing we can do is provide better doc. I tried in a case study, but I think I need to try harder.

If I'm thinking about this right, what we have is:

I agree that what most users are going to expect is what we have now. I'm guessing Aki may want the latter in order to do a Laplace approximation of the unconstrained posterior, draws from which would then be translated back to the constrained space as an approximate Bayesian method.

avehtari commented 6 years ago
  • no Jacobian: MAP for the constrained model
  • Jacobian: MAP for the unconstrained model

Good summary, which would good to have in the case study, too.

I agree that what most users are going to expect is what we have now.

I'm not the only one expecting Jacobian. It seems it's what machine learning people expect.

I'm guessing Aki may want the latter in order to do a Laplace approximation of the unconstrained posterior, draws from which would then be translated back to the constrained space as an approximate Bayesian method.

Yes. The posterior is often closer to normal distribution in unconstrained space. This will especially helpful when we get, for example, compound GLVM+Laplace functions. In these cases it's possible that marginal likelihood evaluations are costly, but the marginal posterior is low dimensional and close to normal in unconstrained space. This what INLA, GPstuff etc. use.

mbrubake commented 6 years ago

Right, ok, I get this. My main concern was the MLE vs MAP framing in the original issue description. Perhaps the flag could be something about "optimize posterior in constrained space" vs "optimize posterior in unconstrained space".

I can see why some ML people would expect this behaviour, but I still think it's the wrong behaviour for most "normal" (i.e., non-ML) users so we want to make sure the language/description of the option is super clear.

bob-carpenter commented 6 years ago

Thanks, @mbrubake, I'll clarify in the issue---it was my sloppy language.

avehtari commented 6 years ago

I can see why some ML people would expect this behaviour, but I still think it's the wrong behaviour for most "normal" (i.e., non-ML) users so we want to make sure the language/description of the option is super clear.

I did mention INLA, and INLA is definitely non-ML (and full of normal :)

Perhaps the flag could be something about "optimize posterior in constrained space" vs "optimize posterior in unconstrained space".

This is very good proposal.

ksvanhorn commented 6 years ago

Let me chime in here as another user. Sometimes I have to settle for MAP estimates for predictive tasks because I can't afford the computational burden of full MCMC, and VB is still experimental. I'm not trying to do a Laplace approximation, I just want a better point estimate for prediction. In this case I prefer a MAP estimate on the unconstrained space, since I assume that the posterior will be less skewed, and hence the MAP estimate will be more representative of the posterior.

bob-carpenter commented 6 years ago

Thanks for chiming in. It would be interesting to compare the differences from Andrew's perspective of taking MAP plus Hessian to be a kind of approximate Bayes. There are two different ways to do this.

Stan doesn't give you an easy way of calculating the Hessian on the constrained space yet. That's another feature we'd like to be able to get to in order to provide traditional error approximations.

We'll definitely get to this feature.

avehtari commented 6 years ago

It would be interesting to compare the differences from Andrew's perspective of taking MAP plus Hessian to be a kind of approximate Bayes.

With the new distributional approximation diagnostics https://arxiv.org/abs/1802.02538 we can also find out when the posterior is simple enough (normal enough) that MAP is good, and using PSIS we can then also correct the estimates. It's not going to be common that normal approximation at the mode would be good enough for IS, but in those specific cases the user would be certainly happy if they can be confident that there is no need to run MCMC for better accuracy.

mbrubake commented 6 years ago

One consideration worth mentioning in this context is that the results of "MAP in the unconstrained space" will depend on exactly how we parameterize the constraints. That means that results of this mode could change if the parameterization was changed under the hood. This has already happened once when we changed how unit_vector was parameterized. It's probably not going to happen a lot, but it's not inconceivable.

bob-carpenter commented 6 years ago

MAP with Jacobian on the unconstrained scale is easy to explain formally. I agree with Marcus that it's going to be a lot harder to get across conceptually. I was never really happy with the case study in which I tried to explain these things, but that's the kind of content I'd be aiming for in the manual, only done more clearly.

The unit_vector thing was a bug fix---it never worked properly until the most recent update from Ben.

WardBrian commented 1 year ago

Closed by https://github.com/stan-dev/stan/pull/3152