stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

Inconsistency when using optimizing a model with simplex parameters #406

Closed cscherrer closed 10 years ago

cscherrer commented 10 years ago

I need to find a MAP estimate for a categorical mixture model. First I compile a mode.

stanmod <- stan_model("myModel.stan")

Then this works just fine:

umap <- optimizing(stanmod, data=dat, init=init)

But somehow, the input and the result have different dimensionality. Here's the input:

fit <- sampling(stanmod, data=dat, iter=0) length(unconstrain_pars(fit,init)) [1] 126

And here's the MAP estimate, in the unconstrained space:

length(umap$par) [1] 130

Because of the mismatch, it will of course not allow me to calculate the MAP estimate in the constrained space:

map <- constrain_pars(fit, umap$par) Error: Number of unconstrained parameters does not match that of the model (130 vs 126).

I suspect this different of 4 has to do with the fact that my model has 4 parameters, each of which is a simplex.

The model is below. Thanks!

data {
  int N;
  int K;
  int G;
  int D;
  int<lower=0, upper=1> hh[N];
  int<lower=1, upper=G> gg[N];
  int<lower=1, upper=D> dd[N];
  int<lower=0, upper=1> ss[N];
}

parameters {
  simplex[2] ph[K];
  simplex[D] pd[K];
  simplex[G] pg[K];
  simplex[2] ps[K];
}

model {
  for(n in 1:N) {
    real temp[K];
    for(k in 1:K) {
      # log(P[obs n | obs n comes from group k])
      temp[k] <- -log(K) 
        + categorical_log(hh[n]+1,ph[k])
        + categorical_log(dd[n],pd[k])
        + categorical_log(gg[n],pg[k])
        + categorical_log(ss[n]+1,ps[k]);

    }
    increment_log_prob( log_sum_exp(temp));
  }
}
bob-carpenter commented 10 years ago

On 11/19/13, 7:22 PM, Chad Scherrer wrote:

I need to find a MAP estimate for a categorical mixture model. First I compile a mode.

stanmod <- stan_model("myModel.stan")

Then this works just fine:

umap <- optimizing(stanmod, data=dat, init=init)

But somehow, the input and the result have different dimensionality. Here's the input:

fit <- sampling(stanmod, data=dat, iter=0)
length(unconstrain_pars(fit,init))
[1] 126

And here's the MAP estimate, in the unconstrained space:

length(umap$par)
[1] 130

My guess is that umap$par is already on the constrained space. The parameters retain the same order, so if that's true, the values should all be between 0 and 1 and the first three should add up to 1.

We tried to set all the interfaces up to be in the "natural" space.

You're right about the simplex. For a K-simplex, there are (K-1) underlying parameters for identifiability, so if there are 4 simplexes, there should be 4 more constrained parameters than unconstrained ones.

Some comments on the model below.

Because of the mismatch, it will of course not allow me to calculate the MAP estimate in the constrained space:

map <- constrain_pars(fit, umap$par)
Error: Number of unconstrained parameters does not match that of the model (130 vs 126).

I suspect this different of 4 has to do with the fact that my model has 4 parameters, each of which is a simplex.

The model is below. Thanks!

|data { int N; int K; int G; int D; int hh[N]; int gg[N]; int dd[N]; int ss[N]; }

parameters { simplex[2] ph[K]; simplex[D] pd[K]; simplex[G] pg[K]; simplex[2] ps[K]; }

You can also simplify the model by replacing the 2-simplexes with single real values constrained between 0 and 1 and use a Bernoulli distribution.

model { for(n in 1:N) { real temp[K]; for(k in 1:K) {

log(P[obs n | obs n comes from group k])

   temp[k] <- -log(K)
     + categorical_log(hh[n]+1,ph[k])
     + categorical_log(dd[n],pd[k])
     + categorical_log(gg[n],pg[k])
     + categorical_log(ss[n]+1,ps[k]);
 }
 increment_log_prob( log_sum_exp(temp));

} }

This will be a lot faster if you take the logs of simplexes once, then reuse them instead of using categorical_log, which recomputes the log of ph, pd, pg, and ps on each iteration.

I'd recommend getting it working first, though, before optimizing.

cscherrer commented 10 years ago

Hi Bob,

Yes, it could be that umap is returning something in the constrained space. Given the labeling of the components, it seems this could be the case.

If this is so, shouldn't the result be a list, rather than a vector? Maybe this is a problem in RStan that is removed from Stan per se?

I'd love to get things running faster, but it's not clear to me what you mean by, "take the logs of simplexes once". You're talking about the parameters, right? Since these change on each iteration, how would it be possible to cache them?

EDIT: I think I see what you mean now. Precompute the log-probabilities, and hand-code the log-probability increment instead of calling the categorical_log function. Makes sense.

Thanks, Chad

bob-carpenter commented 10 years ago

It's not a Stan issue. CmdStan (as we're dubbing the command-line interface) definitely returns the unconstrained space. But it doesn't do it as a structured list.

Jiqiang provided convenience returns for the sampler, and we should have the optimizer do the same thing. You can get the array of parameter names, but it then takes R fiddling beyond my level of expertise to convert this to a list.

I'm going to close this issue on Stan and open a new feature request on RStan.

As to optimization, I'm thinking this:

model { for(n in 1:N) { real temp[K]; for(k in 1:K) {

log(P[obs n | obs n comes from group k])

   temp[k] <- -log(K)
     + categorical_log(hh[n]+1,ph[k])
     + categorical_log(dd[n],pd[k])
     + categorical_log(gg[n],pg[k])
     + categorical_log(ss[n]+1,ps[k]);
 }
 increment_log_prob( log_sum_exp(temp));

} }

could be rewritten as:

model { vector[2] log_ph[K]; ... repeat for others ... for (k in 1:K) log_ph[k] <- log(ph[k]); ... repeat for others ...

for (n in 1:N) { real temp[K]; for (k in 1:K) {

log(P[obs n | obs n comes from group k])

   temp[k] <- -log(K)
     + log_ph[k, hh[n] + 1]
     + ... repeat for others;
 }
 increment_log_prob(log_sum_exp(temp));

} }

This reduces N * K * 4 log() calls to (4 + D + G) * K logs. So if N >> (4 + D + G) it'll be worth it computationally.

And while it wouldn't really make the code much, if any faster, you could transpose the log_ph (and related) and vectorize the calculation of temp (assuming you declare it as vector[K] type):

temp <- -log(K) + log_ph[hh[n] +1] + ....;

And then the likelihood's just:

for (n in 1:N) increment_log_prob(log_sum_exp(-log(K) + log_ph[hh[n] +1] + ....);

And to really fine-tune, rather than adding a negated term, use subtraction: (b - a) should be 50% faster than (-a + b) and takes 2/3 the memory, assuming a is a parameter (or local variable).

On 11/19/13, 8:28 PM, Chad Scherrer wrote:

Hi Bob,

Yes, it could be that |umap| is returning something in the constrained space. Given the labeling of the components, it seems this could be the case.

If this is so, shouldn't the result be a list, rather than a vector? Maybe this is a problem in RStan that is removed from Stan per se?

I'd love to get things running faster, but it's not clear to me what you mean by, "take the logs of simplexes once". You're talking about the parameters, right? Since these change on each iteration, how would it be possible to cache them?

Thanks, Chad

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/406#issuecomment-28856398.

bob-carpenter commented 10 years ago

What you want to use is this:

params_list <- unconstrain_pars(fit, u_params_vector)

where fit is a Stan fit object, u_params_vector is the vector of unconstrained parameters, and params_list is a list indexed by the names of the variables with their structured values.

For example, take the model

parameters {
  real y;
  simplex[4] theta;
}
model {
  y ~ normal(0,1);
}

in the file simplex.stan and fit it in RStan with:

> fit <- stan('simplex.stan')

Then you can use the fit object to map between constrained and unconstrained parameters. For example,

> theta <- constrain_pars(fit,c(13.2,-1,2,3))

> theta
$y
[1] 13.2

$theta
[1] 0.109231773 0.701022162 0.180747193 0.008998873

The exact transform used for simplexes is described in the chapter on variable transforms near the end of the manual. It was set up so that the transfrom of the 0-vector is symmetric, as in:

constrain_pars(fit,c(13.2,0,0,0));
$y
[1] 13.2

$theta
[1] 0.25 0.25 0.25 0.25

Given that it looks like the functionality Chad was requesting is already in place, I'm closing the issue here on the Stan issue tracker.

If anyone has another feature request for RStan, please submit an issue to RStan's issue tracker.