softmax, categorical, multinomial with K-1 parameters for identifiability

stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

2.58k stars 368 forks source link

softmax, categorical, multinomial with K-1 parameters for identifiability #266

Open bob-carpenter opened 10 years ago

bob-carpenter commented 10 years ago

Add new versions of all these functions for the (K-1)-vector parameterization of softmax.

softmax_dofm1(vector x)  =def=  softmax(c(x,0) )

where c(x,0) is the R function to append 0 to the end of x.

The advantage is that the degrees of freedom matches that of the simplex target, so the model's identified without priors.

The disadvantage is that the priors have to change relative to the K-value parameterization and are no longer symmetric.

[ ] Any suggestions for a better name than dofm? I suppose we could say identified.
[ ] Any preference for c(x,0) or c(0,x), i.e., should the 0 value be the first or the last element?

bob-carpenter commented 10 years ago

The current workaround requires too much index fiddling.

vector[K-1] alpha_dofm1;
...
{
  vector[K] alpha;
  for (k in 1:(K-1))
    alpha[k] <- alpha_dofm1[k];
  alpha[K] <- 0;
  ... use softmax(alpha) as before --- it is a K-simplex ...
}

Note that alpha is declared in a new block to make it a local variable.

ariddell commented 10 years ago

Should we perhaps add this workaround to the manual? Apart from the verbosity is there a reason that those interested in using the K-1 parameterization shouldn't start using this?

IMHO, anyone who can write a Stan model that uses softmax or categorical_logit can handle this without any problems.

ariddell commented 10 years ago

This gets kindof ugly if one has matrix[K-1,P] alpha; instead of just vector[K-1] alpha;.

bob-carpenter commented 10 years ago

This will be much easier with the equivalent of the new append operations (see pull request #790). If K is the number of dimensions for the simplexes, then

parameters {
  matrix[N, K-1] alpha_raw;
}
model {
  matrix[N, K] alpha;
  alpha <- col_append(alpha_raw, rep_vector(0, N));  
  ...
  y[n] ~ categorical_logit(alpha[n]);
}

There shouldn't be any need to actually use the softmax operation directly. Of course, users might want to see the probabilities, at which case, the generated quantities could be used (if alpha is a transformed parameter---otherwise it would need to be redefined).

generated quantities {
   simplex[K] theta[N];
    for (n in 1:N) theta[n] <- softamx(alpha[n]);
}

bob-carpenter commented 10 years ago

This would all be easier if we had

an append operation that took a vector and a scalar
vectorization of categorical_logit to take in an array of vectors --- we don't have the support classes for this, which is why it's not there yet, but it could be added (see how multivariate normals are done and perhaps consolidate the code)

ariddell commented 10 years ago

I am working on a multi-level varying-intercept, varying-slope categorical logit model right now. I tried the K-1 parameterization but it was painful. I take it append_col is slated for 2.4++?

Right now I'm inclined to try identifying the model using redundant parameters. I'd much prefer the K-1 method.

bob-carpenter commented 10 years ago

I think the commit went in after 2.4. So it's in the current dev branch.

On Jul 28, 2014, at 12:56 PM, Allen Riddell notifications@github.com wrote:

I am working on a multi-level varying-intercept, varying-slope categorical logit model right now. I tried the K-1 parameterization but it was painful. I take it append_col 2.4++?

— Reply to this email directly or view it on GitHub.

randommm commented 10 years ago

yep.

On 14-07-28 03:43 PM, Bob Carpenter wrote:

I think the commit went in after 2.4. So it's in the current dev branch.

Bob

ariddell commented 10 years ago

Ok. I'll be a tester for the append_col approach. It looks good. And I certainly prefer raw to dofm1.

bob-carpenter commented 10 years ago

I don't think "raw" is descriptive enough. I agree that "dof" is confusing, though, and don't have a good suggestion about what to call a new function.

On Jul 28, 2014, at 5:36 PM, Allen Riddell notifications@github.com wrote:

Ok. I'll be a tester for the append_col approach. It looks good. And I certainly prefer raw to dofm1.

— Reply to this email directly or view it on GitHub.

ariddell commented 10 years ago

This seems to be working well 2.5pre. Thanks for the help.

model_code = """
data {
  int<lower=1> N;  // number of observations
  int<lower=1> P;  // number of features
  int<lower=1> K;  // number of target categories
  int<lower=1> J;  // number of groups
  int<lower=1, upper=K> target[N];
  int<lower=1, upper=J> group[N];
  vector[P] x[N];
}
parameters {
  vector[K-1] alpha_raw[J];  // group intercepts
  matrix[K-1, P] beta_raw[J];
}
transformed parameters {

  vector[K] alpha[J];
  matrix[K, P] beta[J];  // K-1 handling

  // alpha K-1 handling
  for (j in 1:J) {
    for (k in 1:(K-1))
        alpha[j][k] <- alpha_raw[j][k];
    alpha[j][K] <- 0;
  }

  // beta K-1 handling
  for (j in 1:J) {
    for (k in 1:(K-1))
      beta[j][k] <- beta_raw[j][k];
    beta[j] <- append_row(beta_raw[j], rep_row_vector(0, P));  // append_row is in Stan version 2.5pre
  }
}
model {

  for (j in 1:J)
    alpha_raw[j] ~ normal(0, 5);

  for (j in 1:J)
    for (k in 1:(K-1))
      beta_raw[j][k] ~ normal(0, 5);  // beta_raw[j][k] is a 1xP row vector

  for (i in 1:N)
    target[i] ~ categorical_logit(alpha[group[i]] + beta[group[i]] * x[i]);
}
generated quantities {
   simplex[K] theta[J];
   for (j in 1:J) theta[j] <- softmax(alpha[j]);
}
"""