paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 183 forks source link

Adding Support for Multinomial-Logistic Normal #338

Closed jsilve24 closed 6 years ago

jsilve24 commented 6 years ago

Hi Paul,

As I have said before, awesome package. Very glad to see this available to the community. I was wondering if it would be possible to implement a Compound Multniomial-Logistic Normal Distribution (useful in many different models including correlated topic models and much much more).

Specifically would it be possible support models of the form: latex-image-1

Where phi^-1 is some bijective exponential transform between the D-dimensional simplex and D-1 dimensional Real space (e.g., like a bijective softmax)

  vector softmax_id(vector alpha) {
    vector[num_elements(alpha) + 1] alphac1;
    for (k in 1:num_elements(alpha))
      alphac1[k] = alpha[k];
    alphac1[num_elements(alphac1)] = 0;
    return softmax(alphac1);
  }

I realize this is a little different than typical models as its not technically a Multilevel model. But it is a type of compound distribution like the zero-inflated distributions already implemented in brms (it just compound distribution over an infinite set of integers rather than just over 0/1).

Thanks again! Justin

paul-buerkner commented 6 years ago

This looks pretty similar to what is already implemented with the categorical family in brms. Would you mind taking a look and tell me in which regards it differs?

jsilve24 commented 6 years ago

I believe the big difference is that the model I wrote above has the extra-multinomial (or extra-categorical) variability coming from the v_i term. That is, there is extra multivariate normal/logistic-normal noise in the likelihood beyond the multinomial. There would be priors on the terms α, β, and V.

Am I missing something? Can this already be done with brms?

paul-buerkner commented 6 years ago

Can't this just be implemented via a random intercept, i.e. terms of the form (1|group), where group is some grouping variable over which you want your intercept to vary?

jsilve24 commented 6 years ago

HAHAHAHAHA your totally right! I feel pretty dense.

Thank you!

paul-buerkner commented 6 years ago

You are welcome :-)

jsilve24 commented 6 years ago

I am actually not sure this is equivalent.

Question: Does the categorical response distribution in brms allow for "multinomial" responses? Typically when I think categorical I think something that can be represented as a factor vector in R. On the other hand multinomial responses are actually a vector of counts (e.g., Y_ij represents the number of counts for category j seen in sample i).

Does brms allow multinomial responses?

paul-buerkner commented 6 years ago

Well, it is possible indirectly. Just as you can code a single binomial response using multiple rows each with a bernoulli response, you can code a single multinomal response using multiple rows each with a categorical response. Computationally, the latter is of course far less efficient.

I would like to add multinomial logit / probt to brms, but unfortunately, there is not built in for this distribution in Stan yet. rstanarm has something implemented on a branch (see https://github.com/stan-dev/rstanarm/tree/feature/mnp), but I am not sure how well this works of if I can generalize this to meet the flexibility of brms.

paul-buerkner commented 6 years ago

Did this approach work out for you?

jsilve24 commented 6 years ago

I don't think so...

My entire dataset has about 20-40 million counts, so the categorical method will likely be extremely inefficient. The mnp thing seems to be for Multinomial probit which is different than what I was hoping for. For my purposes the logistic component is very important.

I think the best option may be the multinomial-poisson trick / transform as discussed here: http://www.r-inla.org/faq#TOC-I-have-multinomial-data-but-I-cannot-find-the-multinomial-likelihood-isn-t-it-available- http://www.r-inla.org/faq#TOC-I-have-multinomial-data-but-I-cannot-find-the-multinomial-likelihood-isn-t-it-available-

I have gotten it working in R-INLA but I still need some of the extra flexibility of brms/stan. I will likely try this out in brms soon. That said, what would be really great is if this trick could be done covertly behind the scenes by brms as its a pain to implement yourself. Thoughts?

Justin

On Feb 26, 2018, at 11:42 AM, Paul-Christian Bürkner notifications@github.com wrote:

Did this approach work out for you?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/338#issuecomment-368565659, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbnRjTE3edTWeccfAozDHPikX13Othsks5tYt8BgaJpZM4R3r2s.

jsilve24 commented 6 years ago

Is it possible to set / constrain a group level parameter to zero and / or to impose a sum constraint? Either option would be required to use the Poisson representation as done in R-INLA.

For example, I know In R-INLA you can set a grouping variable level to NA and it just is essentially like constraining the corresponding parameter to zero.

Is this possible in brms?

Justin

Sent from my Mobile Device

On Feb 26, 2018, at 11:42, Paul-Christian Bürkner notifications@github.com wrote:

Did this approach work out for you?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

paul-buerkner commented 6 years ago

Yeah, 20-40 million counts is definitely too much... To how many rows would that translate in the multinomial case?

I think rstanarm also has a mnl brach which implements multinomial logit I think. Anyway, it would be good to implement both link functions, the question is just with which I should start. I would prefer the one, which is easier to implement, and I guess that's the logit one.

paul-buerkner commented 6 years ago

I am rather tired right now, so apologies if I suggest something stupid, but do we really need something special for the poisson transformation? I mean, in the standard parameterization of random effects, they are centered around zero anyway. So wouldn't the following work in general?

df <- data.frame(
  x = rep(0:1, each = 50),
  y1 = c(sample(1:10, 50, TRUE), sample(6:15, 50, TRUE)),
  y2 = c(sample(11:20, 50, TRUE), sample(6:15, 50, TRUE)),
  y3 = c(sample(21:30, 50, TRUE), sample(6:15, 50, TRUE)),
  obs = 1:100
)

library(tidyverse)
df_long <- df %>%
  gather("key", "y", y1:y3)

library(nnet)
df$Y <- cbind(df$y1, df$y2, df$y3)
fit1 <- multinom(Y ~ x, df)
summary(fit1)

library(brms)
fit2 <- brm(y ~ key + key:x + (1 | obs), 
            df_long, family = poisson())
summary(fit2)
jsilve24 commented 6 years ago

Nothing stupid about it. You might be right. When you say centered around zero... do you mean that they have a mean zero prior or that they are deterministically constrained to ensure they sum to zero? (Just confirming, the second is what is needed to ensure the Poisson model is identical to the multinomial).

I will take a closer look at what you just suggested when I am in front of my laptop.

Sent from my Mobile Device

On Mar 1, 2018, at 17:57, Paul-Christian Bürkner notifications@github.com wrote:

I am rather tired right now, so apologies if I suggest something stupid, but do we really need something special for the poisson transformation? I mean, in the standard parameterization of random effects, they are centered around zero anyway. So wouldn't the following work in general?

df <- data.frame( x = rep(0:1, each = 50), y1 = c(sample(1:10, 50, TRUE), sample(6:15, 50, TRUE)), y2 = c(sample(11:20, 50, TRUE), sample(6:15, 50, TRUE)), y3 = c(sample(21:30, 50, TRUE), sample(6:15, 50, TRUE)), obs = 1:100 )

library(tidyverse) df_long <- df %>% gather("key", "y", y1:y3)

library(nnet) df$Y <- cbind(df$y1, df$y2, df$y3) fit1 <- multinom(Y ~ x, df) summary(fit1)

library(brms) fit2 <- brm(y ~ key + key:x + (1 | obs), df_long, family = poisson()) summary(fit2) — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

paul-buerkner commented 6 years ago

mean zero prior, but everything non-zero will be absorbed into the intercept, so they will stay around zero.

jsilve24 commented 6 years ago

So that doesn’t work. The resulting Poisson model is not the same as the multinomial model. In theory there is a chance of getting the right answer but it’s at best approximate and potentially quite different than the intended model.

Is it not possible to for example just internally drop one of the parameters e.g. set beta_0 = 0? This seems so simple in Stan... but I don’t understand the inner workings of brms.

Sent from my Mobile Device

On Mar 1, 2018, at 18:08, Paul-Christian Bürkner notifications@github.com wrote:

mean zero prior, but everything non-zero will be absorbed into the intercept, so they will stay around zero.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

paul-buerkner commented 6 years ago

Can you give me more of a reasoning, why this doesn't work? l understand it is not identical, but I would expect it come pretty close, since the only difference is hierachical centering as opposed to hard sum-to-zero constraint. I may of course as well be totally mistaken here.

brms is designed as a high level interface, not as a complete programming lanuage such as Stan. This is not about the internals of brms, but about its syntax, which currently cannot reflect setting a certain random effect value to zero. You can always extract the Stan code of brms generated via make_stancode, change it according to your needs and fit it directly with Stan. The related data passed to Stan can be prepared via make_standata.

jsilve24 commented 6 years ago

No your not totally mistaken. 2 points. (1) Its not the model you expect and you will never now how "close" you are, actually you can prove that you will never now how close you are on real data in the absence of known ground truth. (2) While you will never know how close you are, you can know what types of datasets are the most likely to have problems with these issues and in my experience, its particularly bad in every single type of multinomial count data I have seen.

Take the case of a two dimensional multinomial / binomial. Let pi = {pi_1, pi_2} such that pi_1+pi_2 = 1 be the multinomial/binomial parameters. Importantly these parameters only have one degree of freedom (pi_1 is completely determined by pi_2 and vice versa). This can be essential, inferences involving these parameters must take such a strong deterministic dependence into account. Alternatively, if you model these as you suggest, there is no guarantee that the inferred posterior will correctly capture this extreme dependency. To drive the point home, lets say between two groups (case vs. control) pi_1 increases and pi_2 decreases. Your model could very well put significant probability mass on both increasing or both decreasing or some combination that does not exactly cancel out. In fact it is almost certain that your inferred model will not satisfy the sum constraint especially if the total amount of counts in each multinomial realization is varying significantly between samples. Lets say in the case sample 1000 total counts were observed between multinomial categories 1 and 2. Whereas in the control group only 100 counts were observed. Assuming that pi_1 is approximately on the same order as pi_2 then you would imagine that the counts Y_1 would be higher in the case group than the control and same for Y_2. In this case, what information does the model really have to tell the difference between a change in the total number of counts compared to having both pi_1 and pi_2 higher in the case compared to control groups? The model only has the weak zero mean assumption which is really saying something completely different.

Now, you could say that in the above example, you could just take the posterior estimates and normalize them (e.g., force the sum constraint on posterior samples) but this actually doesn't work. Perhaps it is easier to convince yourself that this doesn't work if you think about a regression problem rather than simply two point masses (one for each group) in the parameter space. You need the sum constraint to ensure that the inferred regression parameters are identical between the poisson and multinomial models. Changes in the total number of counts between samples can confound inferences regarding parameters.

Hope this explains it a little and was not overly rambling.

On Mar 1, 2018, at 6:41 PM, Paul-Christian Bürkner notifications@github.com wrote:

Can you give me more of a reasoning, why this doesn't work? l understand it is not identical, but I would expect it come pretty close, since the only difference is hierachical centering as opposed to hard sum-to-zero constraint. I may of course as well be totally mistaken here.

brms is designed as a high level interface, not as a complete programming lanuage such as Stan. This is not about the internals of brms, but about its syntax, which currently cannot reflect setting a certain random effect value to zero. You can always extract the Stan code of brms generated via make_stancode, change it according to your needs and fit it directly with Stan. The related data passed to Stan can be prepared via make_standata.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/338#issuecomment-369770134, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbnRoj9KQefqequKtQkh7nZLGB8_Rfyks5taIcOgaJpZM4R3r2s.

jsilve24 commented 6 years ago

One idea would be to include a dummy regressor e.g., in addition to the categorical grouping variable include an "observed regressor" that is binary zero or 1. And set that regressor to zero for all count observations that correspond to beta_0. The model will try to fit that variable in the background but it won't effect your inference if you just ignore it... However, I have played around with Stan in situations where I accidentally leave such an unused parameter in the model, it does not go well and usually the posterior variation around that parameter blows up which leads to poor curvature and poor mixing with lots of max_treedepth warnings...

paul-buerkner commented 6 years ago

Thanks for all the explanations. Yes this makes sense. Because I may have misunderstood on which paramter to put the the sum-to-zero contraint, I will have to ask again.

I found this very simple explanation on the internet: https://stats.stackexchange.com/questions/105282/logistic-multinomial-regression-as-two-multiple-poisson-regressions

where we have parameters eta, theta_i, alpha_j and beta_j with i being observations and j being categories. Given that the explanation is correct, which of those need a sum-to-zero (or fix-one-to-zero) constraint to make sure the model is correctly specified?

paul-buerkner commented 6 years ago

To put it differently, when you run

fit2 <- brm(y ~ key + key:x + (1 | obs), data = df_long, family = poisson())

as in my example above, and then compute for instance

hypothesis(fit2, "keyy2:x - keyy1:x = 0")

I would expect, based on the explanation I linked to, that those are the coefficients of x of the multinomial model, and at least in my toyish example they are. Sorry, again if I am still missing an important point.

jsilve24 commented 6 years ago

I have removed my previous comment as I have just thoroughly confused myself and I am wondering if you are correct. I will have a response shortly. Sorry for the trouble.

paul-buerkner commented 6 years ago

Did you come to a conclusion about the usage of the poisson models?

jsilve24 commented 6 years ago

I apologize, this is still on my radar. Just been working on finishing up a separate manuscript. Will hopefully be able to get back to this shortly (few weeks).

jsilve24 commented 6 years ago

Just spend a while trying to work this out. Here are my thoughts:

On the theory: Yes I think this approach may work with the poisson and then manipulation of the posterior samples afterwards. I have not gotten to actually test this out for the reasons below.

On brms's ability to fit the model I specified above In my first comment I mentioned that I hoped to model v_i ~ N(0,V) (where v_i is a D-1 dimensional vector). I am not sure but can BRMS acctually model this covariance between v_ij and v_ik for j!=k? I think this might be possible with the multivariate syntax but manually cbinding all D dimensions (often at least 10) would be very tedious. Am I missing something? What if I was happy saying v_ij ~ N(0, sigma_j) (e.g., saying V was diagonal but not with identical elements)?

paul-buerkner commented 6 years ago

To model these varying effects v_i including correlations, you could try

fit3 <- brm(y ~ key + key:x + (0 + key | obs), data = df_long, family = poisson())
# or 
fit4 <- brm(y ~ key + key:x + (1 + key | obs), data = df_long, family = poisson())

which in both case would produce a D dimensional vector of varying effects. Why should it be D-1 dimensional? In any case, you can specify that as well by manually defining the constrasts and replace (0 + key | obs) with (0 + cont1 + cont2 + ... | obs). If you don't want correlations to be modeled, replace | with ||.

jsilve24 commented 6 years ago

I think this should work. I still have lots to think about to speed this up. Seems to be quite slow. But I "THINK" it works... sorry to be so slow and frustrating! I will close this for the moment.