stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
733 stars 183 forks source link

Non-centered parameterization of normal_log density #77

Closed syclik closed 8 years ago

syclik commented 9 years ago

From @rtrangucci on February 17, 2015 22:32

Add a normal_ncp_log density so we're not adding 3x the number of vars to var_stack than we need to. Currently:

#include <iostream>
#include <stan/agrad/rev/matrix.hpp>
#include <stan/agrad/rev.hpp>
#include <stan/prob/distributions.hpp>
#include <stan/math/matrix/add.hpp>

int main ()
{
  using stan::agrad::var;

  double sigma = 1.15;
  double mu = 1.1;
  double theta_cp = 1.5;

  Eigen::Matrix<var,Eigen::Dynamic,1> v_theta_cp(6);
  var v_mu(mu);
  var v_sigma(sigma);

  v_theta_cp << theta_cp, theta_cp, theta_cp, theta_cp, theta_cp, theta_cp;

  std::cout << "Instantiate 8 vars: v_theta_cp(6), mu, sigma \n" << std::endl;

  stan::agrad::print_stack(std::cout);

  std::cout << "Calling normal_log\n" << std::endl;

  var lp = stan::prob::normal_log(v_theta_cp, v_mu, v_sigma); 

  std::cout << "Called normal_log \n" << std::endl;

  stan::agrad::print_stack(std::cout);

  stan::agrad::recover_memory();

  std::cout << "Called recover_memory() \n" << std::endl;

  stan::agrad::print_stack(std::cout);

  double tau = 1.15;
  double alpha = 1.1;
  double eta = 1.5;

  Eigen::Matrix<var,Eigen::Dynamic,1> v_eta(6);
  var v_tau(tau);
  var v_alpha(alpha);

  v_eta << eta, eta, eta, eta, eta, eta;

  std::cout << " Instantiate 8 vars: v_tau, v_eta(6), v_alpha \n" << std::endl;

  stan::agrad::print_stack(std::cout);

  Eigen::Matrix<var,Eigen::Dynamic,1> v_theta = stan::math::add(v_alpha, stan::agrad::multiply(v_tau,v_eta));

  std::cout << "Called transformed params \n" << std::endl;

  stan::agrad::print_stack(std::cout);

  var lp_m = stan::prob::normal_log(v_eta,0,1);

  std::cout << "Called Matt trick \n" << std::endl;

  stan::agrad::print_stack(std::cout);

  stan::agrad::recover_memory();

  return 0;
}

Output:

Instantiate 8 vars: v_theta_cp(6), mu, sigma 

STACK, size=8
0  0x7f8c19800000  1.1 : 0
1  0x7f8c19800018  1.15 : 0
2  0x7f8c19800030  1.5 : 0
3  0x7f8c19800048  1.5 : 0
4  0x7f8c19800060  1.5 : 0
5  0x7f8c19800078  1.5 : 0
6  0x7f8c19800090  1.5 : 0
7  0x7f8c198000a8  1.5 : 0
Calling normal_log

Called normal_log 

STACK, size=9
0  0x7f8c19800000  1.1 : 0
1  0x7f8c19800018  1.15 : 0
2  0x7f8c19800030  1.5 : 0
3  0x7f8c19800048  1.5 : 0
4  0x7f8c19800060  1.5 : 0
5  0x7f8c19800078  1.5 : 0
6  0x7f8c19800090  1.5 : 0
7  0x7f8c198000a8  1.5 : 0
8  0x7f8c19800140  -6.71515 : 0
Called recover_memory() 

STACK, size=0
 Instantiate 8 vars: v_tau, v_eta(6), v_alpha 

STACK, size=8
0  0x7f8c19800000  1.15 : 0
1  0x7f8c19800018  1.1 : 0
2  0x7f8c19800030  1.5 : 0
3  0x7f8c19800048  1.5 : 0
4  0x7f8c19800060  1.5 : 0
5  0x7f8c19800078  1.5 : 0
6  0x7f8c19800090  1.5 : 0
7  0x7f8c198000a8  1.5 : 0
Called transformed params 

STACK, size=20
0  0x7f8c19800000  1.15 : 0
1  0x7f8c19800018  1.1 : 0
2  0x7f8c19800030  1.5 : 0
3  0x7f8c19800048  1.5 : 0
4  0x7f8c19800060  1.5 : 0
5  0x7f8c19800078  1.5 : 0
6  0x7f8c19800090  1.5 : 0
7  0x7f8c198000a8  1.5 : 0
8  0x7f8c198000c0  1.725 : 0
9  0x7f8c198000e8  1.725 : 0
10  0x7f8c19800110  1.725 : 0
11  0x7f8c19800138  1.725 : 0
12  0x7f8c19800160  1.725 : 0
13  0x7f8c19800188  1.725 : 0
14  0x7f8c198001b0  2.825 : 0
15  0x7f8c198001d8  2.825 : 0
16  0x7f8c19800200  2.825 : 0
17  0x7f8c19800228  2.825 : 0
18  0x7f8c19800250  2.825 : 0
19  0x7f8c19800278  2.825 : 0
Called Matt trick 

STACK, size=21
0  0x7f8c19800000  1.15 : 0
1  0x7f8c19800018  1.1 : 0
2  0x7f8c19800030  1.5 : 0
3  0x7f8c19800048  1.5 : 0
4  0x7f8c19800060  1.5 : 0
5  0x7f8c19800078  1.5 : 0
6  0x7f8c19800090  1.5 : 0
7  0x7f8c198000a8  1.5 : 0
8  0x7f8c198000c0  1.725 : 0
9  0x7f8c198000e8  1.725 : 0
10  0x7f8c19800110  1.725 : 0
11  0x7f8c19800138  1.725 : 0
12  0x7f8c19800160  1.725 : 0
13  0x7f8c19800188  1.725 : 0
14  0x7f8c198001b0  2.825 : 0
15  0x7f8c198001d8  2.825 : 0
16  0x7f8c19800200  2.825 : 0
17  0x7f8c19800228  2.825 : 0
18  0x7f8c19800250  2.825 : 0
19  0x7f8c19800278  2.825 : 0
20  0x7f8c19800300  -12.2636 : 0

Copied from original issue: stan-dev/stan#1302

syclik commented 9 years ago

From @bob-carpenter on February 18, 2015 5:14

This is great you're tracking stray vars.

What is normal_ncp? Does it have a formula? My only guess is non-centered-parameterization.

I do think it woudl be nice ot have a standard (unit) normal (hardcoded normal(0,1)) and maybe a centered normal (hardcoded normal(0,sigma)).

syclik commented 9 years ago

From @bob-carpenter on February 18, 2015 5:15

Also, whenever you create an issue, you should add labels for it and provide some milestone. I added the labels here, but I don't know how urgent this is.

syclik commented 9 years ago

From @rtrangucci on February 18, 2015 5:34

@bob-carpenter Apologies about the ambiguity in the issue. I'll edit the issue text above, but you're correct, normal_ncp is a non-centered parameterization of the normal density. It occurred to me that the gains from using the matt trick were mitigated by the number of extra vars created by the addition and multiplication in transformed parameters. Given the importance of the ncp in hierarchical models, I figured this might have a tangible speedup on the MRP models we're working on. I'll work out derivatives and a functional form, and add it to the issue as well.

syclik commented 9 years ago

From @bob-carpenter on February 18, 2015 5:42

Have you figured out how somebody can write

alpha ~ normal_ncp(mu, sigma);

and have it use the non-centered parameterization directly? I'm not even sure how that's possible.

I assume we're talking about the centered:

parameters { real alpha; ... model { alpha ~ normal(mu,sigma);

vs. the non-centered:

parameters { real alpha_raw; ...

transformed parameters { real alpha; alpha <- mu + sigma * alpha_raw; ...

model { alpha_raw ~ normal(0,1);

How can you do that with just density function?

On Feb 18, 2015, at 4:34 PM, Rob Trangucci notifications@github.com wrote:

@bob-carpenter Apologies about the ambiguity in the issue. I'll edit the issue text above, but you're correct, normal_ncp is a non-centered parameterization of the normal density. It occurred to me that the gains from using the matt trick were mitigated by the number of extra vars created by the addition and multiplication in transformed parameters. Given the importance of the ncp in hierarchical models, I figured this might have a tangible speedup on the MRP models we're working on. I'll work out derivatives and a functional form, and add it to the issue as well.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

That's what we were shooting for. We haven't figured it out yet, but if we can do it, I think it'll be cool.

On Wednesday, February 18, 2015, Bob Carpenter notifications@github.com wrote:

Have you figured out how somebody can write

alpha ~ normal_ncp(mu, sigma);

and have it use the non-centered parameterization directly? I'm not even sure how that's possible.

I assume we're talking about the centered:

parameters { real alpha; ... model { alpha ~ normal(mu,sigma);

vs. the non-centered:

parameters { real alpha_raw; ...

transformed parameters { real alpha; alpha <- mu + sigma * alpha_raw; ...

model { alpha_raw ~ normal(0,1);

How can you do that with just density function?

  • Bob

On Feb 18, 2015, at 4:34 PM, Rob Trangucci <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@bob-carpenter Apologies about the ambiguity in the issue. I'll edit the issue text above, but you're correct, normal_ncp is a non-centered parameterization of the normal density. It occurred to me that the gains from using the matt trick were mitigated by the number of extra vars created by the addition and multiplication in transformed parameters. Given the importance of the ncp in hierarchical models, I figured this might have a tangible speedup on the MRP models we're working on. I'll work out derivatives and a functional form, and add it to the issue as well.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1302#issuecomment-74815927.

syclik commented 9 years ago

From @betanalpha on February 18, 2015 9:20

I don’t think it can be done because the CP vs NCP implementations have fundamentally different state spaces. One possible intermediate step is a cholesky-product-add, which is a marginal improvement, or something like

y ~ normal_ncp(alpha_raw, mu, sigma);

which does the product-add internally. But then this would prevent the centered parameters from being saved as transformed parameters.

On Feb 18, 2015, at 5:57 AM, Daniel Lee notifications@github.com wrote:

That's what we were shooting for. We haven't figured it out yet, but if we can do it, I think it'll be cool.

On Wednesday, February 18, 2015, Bob Carpenter notifications@github.com wrote:

Have you figured out how somebody can write

alpha ~ normal_ncp(mu, sigma);

and have it use the non-centered parameterization directly? I'm not even sure how that's possible.

I assume we're talking about the centered:

parameters { real alpha; ... model { alpha ~ normal(mu,sigma);

vs. the non-centered:

parameters { real alpha_raw; ...

transformed parameters { real alpha; alpha <- mu + sigma * alpha_raw; ...

model { alpha_raw ~ normal(0,1);

How can you do that with just density function?

  • Bob

On Feb 18, 2015, at 4:34 PM, Rob Trangucci <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@bob-carpenter Apologies about the ambiguity in the issue. I'll edit the issue text above, but you're correct, normal_ncp is a non-centered parameterization of the normal density. It occurred to me that the gains from using the matt trick were mitigated by the number of extra vars created by the addition and multiplication in transformed parameters. Given the importance of the ncp in hierarchical models, I figured this might have a tangible speedup on the MRP models we're working on. I'll work out derivatives and a functional form, and add it to the issue as well.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1302#issuecomment-74815927.

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

From @bob-carpenter on February 18, 2015 10:38

On Feb 18, 2015, at 8:20 PM, Michael Betancourt notifications@github.com wrote:

I don’t think it can be done because the CP vs NCP implementations have fundamentally different state spaces.

That's why I didn't think it was possible.

One possible intermediate step is a cholesky-product-add, which is a marginal improvement, or something like

y ~ normal_ncp(alpha_raw, mu, sigma);

I'm confused --- where did y come from?

Don't we need (mu + sigma * alpha_raw) as a regression coefficient?

If we start rolling up the non-centered priors with the likelihoods, maybe we should just go all the way to implementing non-centered parameterizations of GLMs.

which does the product-add internally. But then this would prevent the centered parameters from being saved as transformed parameters.

You could save them as generated quantities. It'd be a bit redundant, but much more efficient.

syclik commented 9 years ago

From @rtrangucci on February 18, 2015 21:56

Yeah, upon more thought last night and working through the derivatives, the likelihood would need to be integrated somehow (which is what both @bob-carpenter @betanalpha said above). Given the number of vars created above, I think this would be a large enough improvement in speed that it would be worth the added effort of creating a non-centered parameterization of glms.

syclik commented 9 years ago

From @rtrangucci on March 6, 2015 21:4

Repackaging what @betanalpha suggested above:

As a first cut (before we get ragged arrays in the language), we should implement the following:

In Stan parlance

data {
  int<lower=1> N; // number of observations
  int<lower=1> G; // number of groups
  vector[N] y; // observations
  int<lower=1, upper=G> group_id[N] // group assignment IDs
}
paramaters {
  real<lower=0> tau;
  real<lower=0> sigma;
  real alpha;
  vector[G] std;
}
model {
  y ~ normal_hier_ncp(alpha, std, group_id, tau, sigma);

  tau ~ cauchy(0, 2.5);
  sigma ~ cauchy(0, 2.5);
  alpha ~ normal(0, 1);
  std ~ normal(0, 1);
}

The next implementation with ragged arrays should be:

data {
  int<lower=1> N; // number of observations
  int<lower=1> G; // number of groups
  int<lower=1> K; // number of 
  vector[N] y; // observations
  int group_id[N,K]; // group assignment IDs
  int num_per_group[K]; // number of random effects group
}
paramaters {
  real<lower=0> tau[K];
  real<lower=0> sigma;
  real alpha;
  real std[K,num_per_group];
}
model {
  y ~ normal_hier_ncp(alpha, std, group_id, tau, sigma);

  tau ~ cauchy(0, 2.5);
  sigma ~ cauchy(0, 2.5);
  alpha ~ normal(0, 1);
  std ~ normal(0, 1);
}

I think this would allow for faster sampling of hierarchical models where there are random effects for many different factors. If this were to work, a bernoulli logit would be the logical next step.

syclik commented 9 years ago

From @bob-carpenter on March 7, 2015 4:21

I think we need better array composition rather than more specialized normal densities. Once we have more than a couple, they'll be impossible to sort out for users (at least users like me). The problem here is that in multilevel models we might have multiple such indices floating around for multiple groups.

What I'm suggesting is that we need an operation like this:

int ii[N]; vector[I] y;

and then an operation

y[ii] that will return y[ii[1]], y[ii[2]], ...

Once I get some time to work on matrix slicing, we'll have just that. In the interim, we could write a function

vector compose(vector y, int[] ii) { vector[size(ii)] y_ii; for (n in 1:size(ii) y_ii[n] <- y[ii[n]]; return y_ii; }

On Mar 7, 2015, at 8:04 AM, Rob Trangucci notifications@github.com wrote:

Repackaging what @betanalpha suggested above:

As a first cut (before we get ragged arrays in the language), we should implement the following:

In Stan parlance

data { int N; // number of observations int G; // number of groups vector[N] y; // observations int group_id[N] // group assignment IDs } paramaters { real tau; real sigma; real alpha; vector[G] std; } model { y ~ normal_hier_ncp(alpha, std, group_id, tau, sigma);

tau ~ cauchy(0, 2.5); sigma ~ cauchy(0, 2.5); alpha ~ normal(0, 1); std ~ normal(0, 1); }

The next implementation with ragged arrays should be:

data { int N; // number of observations int G; // number of groups int K; // number of vector[N] y; // observations int group_id[N,K]; // group assignment IDs int num_per_group[K]; // number of random effects group } paramaters { real tau[K]; real sigma; real alpha; real std[K,num_per_group]; } model { y ~ normal_hier_ncp(alpha, std, group_id, tau, sigma);

tau ~ cauchy(0, 2.5); sigma ~ cauchy(0, 2.5); alpha ~ normal(0, 1); std ~ normal(0, 1); }

I think this would allow for faster sampling of hierarchical models where there are random effects for many different factors. If this were to work, a bernoulli logit would be the logical next step.

— Reply to this email directly or view it on GitHub.

syclik commented 8 years ago

Closing. @rtrangucci remembers that it would make sense with ragged arrays, but not now.