stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
743 stars 187 forks source link

Embedded Laplace approximation #755

Closed charlesm93 closed 5 months ago

charlesm93 commented 6 years ago

Summary:

Want to create a Laplace approximation for the following distributions:

Description:

Doing a Laplace approximation with the current algebraic solver (which uses a dogleg solver) is prohibitively slow. Instead we could use a Newton line-search method, coupled with some analytical derivatives. This line-search method would be specialized and therefore not exposed to the Stan language, but the approximator would be. Aki and Dan have a prototype in Stan, which uses autodiff. The C++ implementation would naturally be more optimized.

Current Version:

v2.17.0

bob-carpenter commented 6 years ago

Is the only use case you anticipate through the Stan language? Is the idea to have approximate distributions, like poisson_approx or something?

This issue eventually needs to specify what the signatures are of any functions that are being proposed for the math library.

@charlesm93---I assigned to you for v3.0.0 (that's where I'm putting things we definitely want to do but not necessarily for the next release). It helps to add labels to classify the issue.

avehtari commented 6 years ago

This is part of the path towards INLA speed for Gaussian latent variable models in Stan. Anticipated first use case is compound functions for GLMs with Laplace integration over weights. Issue #493 describes compound function

y ~ logistic_regression(x, beta);

where speed-up is obtained with analytic gradients. With Laplace approximation it would be possible to add something like

y ~ logistic_regression_la(x, sigma);

which would assume Gaussian prior on beta with sd sigma, and would use Laplace approximation to integrate over beta. Instead of logistic, the first ones to implement would be Normal, Poisson, Binomial and Negative-Binomial. Eventually there will be different structured Gaussian priors like hierarchical, Markov random field and Gaussian process. Ping @dpsimpson

wds15 commented 6 years ago

@avehtari this sounds great...shouldn't we have a different sigma for the intercept and all other regressors? I would appreciate that as I don't like that much data-driven priors which you essentially end-up if I can only give a single sigma and hence have to transform my data. Maybe too early for these comments, but relevant from my perspective.

avehtari commented 6 years ago

shouldn't we have a different sigma for the intercept and all other regressors?

Sure. I didn't say that sigma would be a scalar. We'll make a more detail proposal soon.

SteveBronder commented 4 years ago

copy pasta for email

One other little thing if you want to squeeze out the last little bits of speed. Note that stan math functions always return back an actual matrix, where one of the main benefits of eigen is to compose multiple expressions together so during compile time it writes them out very efficiently. If you are working with matrices of doubles you should use eigen methods for faster code. Here's a trick you can do with things like third_diff() etc

template <typename T>
Eigen::Matrix<T, Eigen::Dynamic, 1>
third_diff(const Eigen::Matrix<T, Eigen::Dynamic, 1>& theta) const {
  Eigen::VectorXd exp_theta = exp(theta);
  Eigen::VectorXd one = Eigen::VectorXd::Ones(theta.size());
  // Note the auto type!!
  auto nominator = exp_theta.cwiseProduct(exp_theta - one);
  auto denominator = (one + exp_theta).array().sqrt().matrix().cwiseProduct(one + exp_theta);
  return n_samples_.cwiseProduct((nominator.array() / denominator.array()).matrix());
}

So the main bit here is that numerator and denominator being "auto" means they are a wacky expression like ColWiseProd<CWiseSubtract<Matrix, Matrix>,Matrix> that is not actually evaluated at that line. So when you pass thos into the return line the expression for the result is generated fully. (you could also just slap the code for numerator and denominator into the return line but then it starts becoming a bit of a mouthful). But you should only do this in places where you use the expression once, otherwise it's better to eval it into a temporary like you are currently doing with one. Here's an example that shows on the rhs the weird expression type nominator, denominator, and the return type etc gets

https://godbolt.org/z/7bE5n1

SteveBronder commented 4 years ago

Also writing the laplace vari using the reverse_pass_callback() and var<matrix> for the covariance should be very efficient

https://github.com/stan-dev/math/blob/try-laplace_approximation2/stan/math/laplace/laplace_pseudo_target.hpp#L29

dpsimpson commented 4 years ago

This is so cool that this is possible now!

But this is the wrong Laplace branch. The good one is here https://github.com/stan-dev/math/tree/try-laplace_approximation2/stan/math/laplace

On Sun, Sep 27, 2020 at 15:01 Steve Bronder notifications@github.com wrote:

copy pasta for email

One other little thing if you want to squeeze out the last little bits of speed. Note that stan math functions always return back an actual matrix, where one of the main benefits of eigen is to compose multiple expressions together so during compile time it writes them out very efficiently. If you are working with matrices of doubles you should use eigen methods for faster code. Here's a trick you can do with things like third_diff() etc

template

Eigen::Matrix<T, Eigen::Dynamic, 1>

third_diff(const Eigen::Matrix<T, Eigen::Dynamic, 1>& theta) const {

Eigen::VectorXd exp_theta = exp(theta);

Eigen::VectorXd one = Eigen::VectorXd::Ones(theta.size());

// Note the auto type!!

auto nominator = exp_theta.cwiseProduct(exp_theta - one);

auto denominator = (one + exp_theta).array().sqrt().matrix().cwiseProduct(one + exp_theta);

return nsamples.cwiseProduct((nominator.array() / denominator.array()).matrix());

}

So the main bit here is that numerator and denominator being "auto" means they are a wacky expression like ColWiseProd<CWiseSubtract<Matrix, Matrix>,Matrix> that is not actually evaluated at that line. So when you pass thos into the return line the expression for the result is generated fully. (you could also just slap the code for numerator and denominator into the return line but then it starts becoming a bit of a mouthful). But you should only do this in places where you use the expression once, otherwise it's better to eval it into a temporary like you are currently doing with one. Here's an example that shows on the rhs the weird expression type nominator, denominator, and the return type etc gets

https://godbolt.org/z/7bE5n1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/math/issues/755#issuecomment-699674374, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRICBRJKRPYPBEJMYHABEDSH6DYXANCNFSM4EPTKGSA .

SteveBronder commented 4 years ago

Whups, good catch! I just searched for laplace issue and this was the nearest looking one

charlesm93 commented 4 years ago

Good catch @dpsimpson. The reason we have two branches is because the first branch was written when we submitted the manuscript on the adjoint-differentiated Laplace approximation (with code discussed in the paper). The second branch came after discussion on how to improve the code before StanCon, but we didn't want to break "backward compatibility" with the paper.

We get to update the manuscript before its publication, so I'll link to the new branch and delete the old one.

charlesm93 commented 2 years ago

I deleted a bunch of files which I used for my research but wouldn't make it into production code. The commit and the commit prior to those changes, should we need to retrieve those files, are

commit 031a81b8bcc9bfef3dd5c81f791895086db33765 (HEAD -> experimental/laplace, origin/experimental/laplace)
Author: Charles Margossian <charlesm@metrumrg.com>
Date:   Sun Dec 5 19:23:31 2021 -0500

    Delete research tests which are / will not be unit tests.

commit b2e4b2fdfc5a8a0b638eebacb075a7d2e89e572a
Author: Charles Margossian <charlesm@metrumrg.com>
Date:   Sun Dec 5 19:11:08 2021 -0500

    make laplace_marginal_bernoulli_logit_lpmf use general likelihood functor.

dated December 5th. One file we might want to pull back out at some point is the draft for the negative binomial likelihood.

Currently, every unit test under test/unit/math/laplace/ passes (even though some tests are incomplete). I haven't touched the rng tests yet. These files use the analytical derivatives for the likelihood that I worked out in early prototypes of the embedded Laplace. This will be replaced with automatic differentiation, which is a tad more efficient and requires much less coding for every new likelihood we add. There is also a unit test for the user-specified likelihood, even though this feature is more experimental. I have already switched to autodiff for the function that computes the approximate marginal likelihood. Once the rng files are updated, we can delete a good chunk of code under laplace_likelihood_bernoulli_logit and laplace_likelihood_poisson_log.

Since we might want to further compare autodiff and analytical derivatives, I think it's fine to keep this code in for now.

charlesm93 commented 2 years ago

For any future modifications, the unit tests that currently pass should be:

All these tests for consistency between (first-order reverse mode) autodiff and finite diff. For most tests, I have benchmarks from GPstuff but where indicated in the code a benchmark is missing.

charlesm93 commented 2 years ago

Todo list for myself:

EDIT: updated 01/16.

dpsimpson commented 2 years ago

Let me know if you need help!

On Mon, Dec 6, 2021 at 11:47 Charles Margossian @.***> wrote:

Todo list for myself: [] write a primer design doc (per @SteveBronder https://github.com/SteveBronder 's request) [] complete unit tests for existing functions using benchmarks from GPstuff [] work out unit tests for rng functions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/math/issues/755#issuecomment-986341599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRICBS7JNU5BXPDFCJMBWLUPQBY3ANCNFSM4EPTKGSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

charlesm93 commented 2 years ago

What would be the best way to test the rng? I can try to come up with limiting cases for each example, but I'm wondering if there is a general approach. How was this done for INLA and GPstuff?

dpsimpson commented 2 years ago

It's hard to work out how to test this as it currently is - the function does 3 things: competes Laplace approximation (tested elsewhere), computes the mean and covariance matrix (not tested), and does the multi normal rng (function tested, this specific call not). My temptation (which might not be very smart) would be to separate that computation of the mean and the covariance into an inline function and test it separately. Then you can test the rest of it in the same way that, say multi_normal_prec_rng is tested.

wds15 commented 2 years ago

Ripping code apart into testable things is not at all not smart from what I learned.

dpsimpson commented 2 years ago

The other option is to do a probabilistic test that runs the rng 1k+++ times and computes the mean and covariances, but that feels like it would be a very expensive test. (and it would occasionally fail) I can't think of another way to test a multivariate distribution without direct access to that mean and covariance matrix.

dpsimpson commented 2 years ago

Actually, there is one other possibility, that depends on how the rng argument works. If you can control its state, you could compare the output from laplace_base_rng to the multi_normal_rng with analytically computed mean and covariance. That would probably be the sensible thing because they should be (floating point) identical.

bob-carpenter commented 2 years ago

Ripping code apart into testable things is not at all not smart from what I learned.

There are at least three concerns that need to be balanced,

  1. efficiency
  2. code readability and maintainability
  3. testability

I think @wds15 just means that (1) is usually opposed to (2) and (3).

wds15 commented 2 years ago

One different approach I did for one of my R packages where I wanted to run SBC is to proxy the test results from the SBC run. So what I do is:

  1. Run SBC on a cluster and store the final histograms needed to caclculate the chi-square statistic. This is stored with a time-stamp
  2. The unit test is then doing a) check that the time-stamp of the stored SBC run is no older than 0.5y as compared to the runtime of the test b) test for uniformity via a chi-square test; allowing some failures according to what I would expect

This way I can have a massive SBC simulation from which the key results are part of the test evidence and I ensure that the SBC outputs are not too old. That's not ideal, but a compromise. Maybe a more sophisticated version of this is useful here as well?

dpsimpson commented 2 years ago

I think that’s too heavy for such a simple function. Equality of a matrix and a vector OR equality of a draw made with the same random seed is enough to show correctness. (Actually this has to happen twice because there’s a factored and non-factored version)

On Tue, Dec 7, 2021 at 03:19 wds15 @.***> wrote:

One different approach I did for one of my R packages where I wanted to run SBC is to proxy the test results from the SBC run. So what I do is:

  1. Run SBC on a cluster and store the final histograms needed to caclculate the chi-square statistic. This is stored with a time-stamp
  2. The unit test is then doing a) check that the time-stamp of the stored SBC run is no older than 0.5y as compared to the runtime of the test b) test for uniformity via a chi-square test; allowing some failures according to what I would expect

This way I can have a massive SBC simulation from which the key results are part of the test evidence and I ensure that the SBC outputs are not too old. That's not ideal, but a compromise. Maybe a more sophisticated version of this is useful here as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/math/issues/755#issuecomment-986930150, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRICBQXCNO2LSTRY3DDXTTUPTPBDANCNFSM4EPTKGSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

charlesm93 commented 2 years ago

I ended up running with one of @dpsimpson's suggestions and implementing it in laplace_poisson_log_rng_test.cpp. I took a limiting case with a diagonal prior covariance and worked out the derivative of the objective function. You then pass this to algebra_solver to obtain the mean. The covariance of the Laplace approximation can be calculated by hand (as a function of the mode).

I then ran a bunch of simulations (1e6) and checked estimators of the mean, variance and covariance. On a two group problem, this test takes 8 seconds to run. It's not ideal but it's not horrendous.

I'm not sure how to control the rng state. Both laplace_poisson_log_rng and multinormal_rng admit an rng argument with the type

boost::random::mt19937

but I couldn't fix this state. I'm assuming it's possible, in which case, we can implement Dan's second proposition. It's a bit of work to get the analytical results, but only a bit and I think it's good to have unit tests which rely on analytical calculations.

charlesm93 commented 2 years ago

Thanks @SteveBronder for showing me how to control the random seed. Using this, I was able to produce the same sample using the embedded Laplace and multi_normal_rng. The mean and covariance for the latter are calculated (semi-)analytically.

  boost::random::mt19937 rng;
  rng.seed(1954);
  Eigen::MatrixXd theta_pred
    = laplace_poisson_log_rng(sums, n_samples, covariance_function, phi,
                              x_dummy, d0, di0, theta_0, rng);

  // Compute exact mean and covariance.
  Eigen::VectorXd theta_root
    = algebra_solver(stationary_point(), theta_0, phi, d0, di0);
  Eigen::MatrixXd K_laplace
    = laplace_covariance(theta_root, phi);

  rng.seed(1954);
  Eigen::MatrixXd theta_benchmark
    = multi_normal_rng(theta_root, K_laplace, rng);

  double tol = 1e-3;
  EXPECT_NEAR(theta_benchmark(0), theta_pred(0), tol);
  EXPECT_NEAR(theta_benchmark(1), theta_pred(1), tol);

Is this enough? I think this checks the boxes, i.e. making sure the rng function generates a sample from the correct approximating normal distribution.

SteveBronder commented 2 years ago

Seems good! Is this all up on a branch somewhere?

charlesm93 commented 2 years ago

Yes. experimental/laplace.

SteveBronder commented 2 years ago

Just making a note we removed an experimental feature to flip between diagonal and non-diagonal covariance estimates in commit 6b97f1f7f30c1d230755232cdc8fb68201554d4d. We also remove the do_line_search argument there (and just use max_line_search > 0 to decide the line search)

charlesm93 commented 2 years ago

I'm going to work on the following update:

Update: ... and job done.

Next up: update the notebook and the examples therein (and done!)

charlesm93 commented 2 years ago

Up next, I want to tackle the sparse covariance case. Of interest is the scenario where the Hessian and the covariance are both sparse, as is the case in a model from our BLS colleagues. In this scenario the B-matrix (the one matrix on which we perform a Cholesky decomposition) also becomes sparse. If B is diagonal all operations become O(n), as opposed to O(n^3). More generally, if m is the largest block size between the Hessian and the covariance, the complexity of the method would be O(nm^2).

Before implementing this at a production level, I want to build a proof of concept. My idea would be to add an optional argument:

charlesm93 commented 5 months ago

Moving this to #3065