katezhangwen / Efficiency-of-marginalising-over-discrete-latent-parameters

4 stars 2 forks source link

efficiency of Stan code #3

Open bob-carpenter opened 2 years ago

bob-carpenter commented 2 years ago

There are a few things you can do to speed up the Stan code.

[Dawid-Skene]

Line 29 can just be:

vector[K] log_p_z = log(pi);  // vectorized

Line 33-38 would be more efficient as:

{
  vector[K] log_theta[J, K] = log(theta);  //  log only once
  for (n in 1:N)
    log_p_z[ii[n], ] += log_theta[jj[n], , y[n]];  // vectorized
}

Lines 45--50 is much more efficient rewritten with vectorization:

for (k in 1:K)
  theta[ , k] ~ dirichlet(beta[k]);

Note that I've changed beta so that it doesn't vary by j. Having a different prior for each rater probably doesn't make sense (and it's not used according to the paper), so I'd suggest line 20 be replaced with:

vector<lower=0>[K] beta[K];

Finally, given that you're not doing anything with the log likelihood outside of transformed parameters, I'd suggest rewriting transformed parameters to be this:

vector[I] log_lik;
{
  ... everything there used to be, updated as above ...
}
for (i in 1:I)
  log_lik[i] = log_sum_exp(log_p_z[i]);

and then in the model, you can just vectorize the addition directly,

target += log_lik;

Then you can just remove the generated quantities block.

There's much less I/O this way, and I/O is time-consuming, especially on clusters. But if you want the log_p_z you need to save them.

Mixtures

This code's simpler, but can still be optimized.

First, you can just drop line 19---it's a no-op because dirichlet(1) is uniform. If you insist on including this, the rep_vector definition should go in generated quantities so it's not built each model loop.

The main thing to optimize is the for loop in 20--24, which can be redone with vectorization as (moving the declaration on line 16 into the loop for enhanced readability):

vector[3] log_mixing_p = log(mixing_p);  // cache logs
for (n in 1:N) {
  vector[3] log_component = log_mixing_p;
  for (k in 1:3)
    log_component[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
  target += log_sum_exp(log_component);
}

With some extra work this can be made even more efficient, but I don't want to make the code unreadable beyond caching all the super-expensive log calculations.

Also, there's a more efficient log_mix function built in for two-component mixtures. I had thought we'd extended to vectors, but I don't see it in our doc, which I think is an oversight we need to fix.

katezhangwen commented 2 years ago

Hi @bob-carpenter ,

Thanks for all your suggestions. I am unsure about how the vectorisation is done for log_p_z. I am assuming that on line 29 you meant:

vector[K] log_p_z[I] = log(pi);

since log_p_z is a two-dimensional array.

The model after following your suggestions is:

data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> J;               // number of annotators
  int<lower=1> K;               // number of annotation categories
  int<lower=1> I;               // number of items
  int<lower=1, upper=I> ii[N];  // item index for annotation n
  int<lower=1, upper=J> jj[N];  // annotator for annotation n
  int<lower=0, upper=K> y[N];   // annotation for observation n
  vector<lower=0>[K] alpha;     // prior for pi
  vector<lower=0>[K] beta[K];   // prior for theta
}

parameters {
  simplex[K] pi;
  simplex[K] theta[J, K];
}

transformed parameters {
  vector[I] log_lik;
  vector[K] log_theta[J, K] = log(theta);  //  log only once
  vector[K] log_p_z[I] = log(pi);
  for (n in 1:N) {
    log_p_z[ii[n], ] += log_theta[jj[n], , y[n]];  // vectorized
  }
  for (i in 1:I) {
    log_lik[i] = log_sum_exp(log_p_z[i]);
  }

}

model {
  // prior on pi
  pi ~ dirichlet(alpha);

  for (k in 1:K) {
    theta[ , k] ~ dirichlet(beta[k]);
  }
  target += log_lik;
}

However, I get this error from running the model (using rstan 2.21.5):

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable definition base type mismatch, variable declared as base type vector[ ] variable definition has base type vectorVariable definition dimensions mismatch, definition specifies 2, declaration specifies 1 error in 'modelce8380d741e_DawidSkene' at line 31, column 32
  -------------------------------------------------
    29:   vector[I] log_lik;
    30:   vector[K] log_theta[J, K] = log(theta);  //  log only once
    31:   vector[K] log_p_z[I] = log(pi);
                                       ^
    32:   for (n in 1:N) {
  -------------------------------------------------

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'DawidSkene' due to the above error.

Maybe I am missing something about your comment of saving the log_p_z

if you want the log_p_z you need to save them

I am not sure what you mean here.

Thanks in advance!

bob-carpenter commented 2 years ago

The problem is that pi and log_p_z do not have the same shape.

simplex[K] pi;
...
vector[K] log_p_z[I] = log(pi);

I think this is what you want:

vector[K] log_p_z[I] = rep_array(log(pi), I);

so that each entry of log_p_z is initialized to log(pi). But then I'd suggest not saving all the intermediates, so you can put those in a local block. Altogether, that's

transformed parameters {
  vector[I] log_lik;
  {  // this extra block makes log_theta and log_p_z local variables so they're not saved
    vector[K] log_theta[J, K] = log(theta);  //  log only once
    vector[K] log_p_z[I] = rep_array(log(pi), I);
    for (n in 1:N) {
      log_p_z[ii[n], ] += log_theta[jj[n], , y[n]];  // vectorized
    }
    for (i in 1:I) {
      log_lik[i] = log_sum_exp(log_p_z[i]);
    }
  }
}
katezhangwen commented 2 years ago

Thanks Bob @bob-carpenter !

My other question is about the prior for theta. The vectorisation

for (k in 1:K)
  theta[ , k] ~ dirichlet(beta[k]);

is not working due to the error:

 SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for: 

  vector[ ] ~ dirichlet(vector)

Available argument signatures for dirichlet:

  vector ~ dirichlet(vector)

Real return type required for probability function.
 error in 'modelce8311d41cfa_DawidSkene' at line 47, column 36
  -------------------------------------------------
    45: 
    46:   for (k in 1:K) {
    47:     theta[, k] ~ dirichlet(beta[k]);
                                           ^
    48:   }
  -------------------------------------------------

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'DawidSkene' due to the above error.

This is because theta[, k] and dirichlet(beta[k]) have different dimensions. I tried rep_array and it doesn't work here. I will really appreciate any help!

Here's the code altogether in case it helps:

/* Bayesian implementation of Dawid and Skene's noisy categorical rating model.
 * This implementation requires data in a 'long' format in order to allow
 * incomplete designs. This implementation is heavily based on the implementation
 * of this model for complete designs given in the Stan Manual section TODO.
 * as well as publicly avaliable code for fitting MAP estimates via EM
 * for the same model. That code can be veiwed here:
 * This code was written before the author was aware of an implemtation of this
 * model in Paun et. al. 2018.
 */

data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> J;               // number of annotators
  int<lower=1> K;               // number of annotation categories
  int<lower=1> I;               // number of items
  int<lower=1, upper=I> ii[N];  // item index for annotation n
  int<lower=1, upper=J> jj[N];  // annotator for annotation n
  int<lower=0, upper=K> y[N];   // annotation for observation n
  vector<lower=0>[K] alpha;     // prior for pi
  vector<lower=0>[K] beta[K];   // prior for theta
}

parameters {
  simplex[K] pi;
  simplex[K] theta[J, K];
}

transformed parameters {
  vector[I] log_lik;
  {  // this extra block makes log_theta and log_p_z local variables so they're not saved
    vector[K] log_theta[J, K] = log(theta);  //  log only once
    vector[K] log_p_z[I] = rep_array(log(pi), I);
    for (n in 1:N) {
      log_p_z[ii[n], ] += to_vector(log_theta[jj[n], , y[n]]);  // vectorized
    }
    for (i in 1:I) {
      log_lik[i] = log_sum_exp(log_p_z[i]);
    }
  }
}

model {
  // prior on pi
  pi ~ dirichlet(alpha);

  for (k in 1:K) {
    theta[, k] ~ dirichlet(beta[k]);
  }
  target += log_lik;
}
bob-carpenter commented 2 years ago

Ack. Looks like it's not vectorized in the code. I'll open an issue to fix that on the Stan side:

https://github.com/stan-dev/docs/issues/560

You'll have to go for the loop now, which isn't so bad because the vectorized forms won't be any more efficient.

katezhangwen commented 2 years ago

Hi Bob, thanks again for all your helpful suggestions! We have updated the preprint with the improved models. See the new version at: https://arxiv.org/abs/2204.06313.

We have incorporated all your suggestions in this thread apart from vectorising the Dirichlet distribution since the model failed to compile. However, we saw in the Stan issue (https://github.com/stan-dev/docs/issues/560) that there is an instance where the model compiled successfully with the vectorisation. Can you explain why our model doesn't work?

bob-carpenter commented 2 years ago

We have updated the preprint

Thanks, @katezhangwen --- I saw the announcement on Discourse and have the revision queued up to read.

Can you explain why our model doesn't work?

Can you share your model and which version of which Stan interface you're using? I'm guessing it's because you're using an older version of Stan.

katezhangwen commented 2 years ago

Hi Bob,

We are using rstan 2.21.5. See the model below.

/* Bayesian implementation of Dawid and Skene's noisy categorical rating model.
 * This implementation requires data in a 'long' format in order to allow
 * incomplete designs. This implementation is heavily based on the implementation
 * of this model for complete designs given in the Stan Manual section TODO.
 * as well as publicly avaliable code for fitting MAP estimates via EM
 * for the same model. That code can be veiwed here:
 * This code was written before the author was aware of an implemtation of this
 * model in Paun et. al. 2018.
 */

data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> J;               // number of annotators
  int<lower=1> K;               // number of annotation categories
  int<lower=1> I;               // number of items
  int<lower=1, upper=I> ii[N];  // item index for annotation n
  int<lower=1, upper=J> jj[N];  // annotator for annotation n
  int<lower=0, upper=K> y[N];   // annotation for observation n
  vector<lower=0>[K] alpha;     // prior for pi
  vector<lower=0>[K] beta[J,K];   // prior for theta
}

parameters {
  simplex[K] pi;
  simplex[K] theta[J, K];
}

transformed parameters {
  vector[I] log_lik;
  {  // this extra block makes log_theta and log_p_z local variables so they're not saved
    vector[K] log_theta[J, K] = log(theta);  //  log only once
    vector[K] log_p_z[I] = rep_array(log(pi), I);
    for (n in 1:N) {
      log_p_z[ii[n], ] += to_vector(log_theta[jj[n], , y[n]]);  // vectorized
    }
    for (i in 1:I) {
      log_lik[i] = log_sum_exp(log_p_z[i]);
    }
  }
}

model {
  // prior on pi
  pi ~ dirichlet(alpha);

  for (j in 1:J) {
    for (k in 1:K) {
       //prior on theta
       theta[j, k] ~ dirichlet(beta[j, k]);
    }
  }

  target += log_lik;
}
bob-carpenter commented 2 years ago

This is a good example of where there's not a good way to order the types because we need the simplex argument to be the last argument for the typing, but we want the second index k to be the last index to avoid copying. If the data size N is very large, it can be useful to re-order log_theta as a 2D array of vectors. If everything was conformal, the n in 1:N loop could be replaced with:

log_p_z[ii] += log_theta[jj, y];

But there's really no way to do that in Stan with current typing without copying theta or log_theta at least once.

We are using rstan 2.21.5.

That doesn't have vectorized Dirichlet. The current Stan version is 2.30, but RStan's hung up on CRAN, etc. There's a more recent 2.26 version of RStan available through GitHub, which would let you vectorize.

cmdstanr keeps up to date with Stan because it's just a light wrapper around cmdstan.