paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 181 forks source link

Problems using emmeans with brms for Ordinal regression #980

Closed nvanzuy closed 8 months ago

nvanzuy commented 4 years ago

Hi,

We are using brms to run some ordinal regression analyses and want to estimate the marginal means from the model. The model we are running is similar to the one given below.

Would it be possible to get some feedback on this issue ? It seems to be related to: https://github.com/rvlenth/emmeans/issues/113

Here is some code which reproduces the problem:

library(tidyverse) library(emmeans) library(brms) library(tidybayes) library(dplyr)

data(mtcars)

mtcars <- mtcars %>% mutate(carb=as.ordered(carb),cyl=as.factor(cyl))

model <- brm( carb ~ cyl, family = cumulative(link = probit, threshold = "flexible"), prior = c( set_prior(prior = "normal(0, 1)", class = "Intercept"), set_prior(prior = "normal(0, 1)", class = "b") ), data=mtcars)

emmeans_model <- emmeans(model,trt.vs.ctrl ~ cyl)

Error in vcov(object)[nm, nm, drop = FALSE] : subscript out of bounds In addition: Warning message: In model.matrix.default(trms, m, contrasts.arg = contr) : variable 'carb' is absent, its contrast will be ignored

If this requires some work I'm happy to have a look at it if you can point me to the function that needs to be amended.

Thanks, Natalie

paul-buerkner commented 4 years ago

The problem is likely with the handling of the ordinal thresholds as they are part of the fixed effects and thus of vcov but I haven't looked in detail yet. Tagging @rvlenth to make him aware of this issue as well.

rvlenth commented 4 years ago

I agree with @paul-buerkner 's assessment. Ordinal models have an intercept for each cut point. The way this is handled in emmeans support for other ordinal models (e.g., ordinal::clm) is to create a pseudo-factor named cut and an expanded model matrix that has a block for each level of cut with the intercept in the corresponding position.

There is also the matter of what one means by "marginal means" in this context. The marginal means of the linear predictor for each cut point? The average of these over cut points (which I support for cld as mode = "latent")? The cumulative probabilities? The probabilities between cut points?

The emmeans support for clm, clmm, and polr models is in https://github.com/rvlenth/emmeans/blob/master/R/ordinal-support.R. It is among the most complex of any code in the package, because of the many possibilities mentioned above. But I think what is probably needed is a fairly straightforward adaptation of all or part of this code. Some of the supported modes require post-processing via functions named in, e.g., misc$postGridHook.

paul-buerkner commented 4 years ago

Thank you @rvlenth! This is really helpful in ensuring that ordinal brms models will be supported in emmeans at some point (once I find the time to dig into the details of the code you linked to).

nvanzuy commented 4 years ago

Thank you both for your response. I look forward to the updates.

paul-buerkner commented 3 years ago

With the latest brms and emmeans versions, emmeans on an ordinal model runs now without me making any changes in brms. @rvlenth did you fix something from your side? I think currently the thresholds are simply ignored.

rvlenth commented 3 years ago

Hmmm. I don't think I changed anything that would affect this. I was also able to get through this without an actual error, though it crashed my R session the first try (who knows why). I also got a warning message about ignoring the contrast matrix for carb.

> EMM <- emmeans(model, ~ cyl)
Warning message:
In model.matrix.default(terms, data, ...) :
  variable 'carb' is absent, its contrast will be ignored

> EMM
 cyl emmean lower.HPD upper.HPD
 4     0.00     0.000      0.00
 6     1.02     0.125      1.92
 8     1.25     0.396      2.06

Point estimate displayed: median 
Results are given on the probit (not the response) scale. 
HPD interval probability: 0.95 

I assume that happens because carb is a factor, but it is also the response variable. The way to avoid that would be to strip off the response variable from attr(object, "contrasts") in about the 14th line of emm_basis.brmsfit.

Currently emm_basis.brmsfit does not have a mode for separating-out the intercepts for the ortdinal model, and my guess is that what we are seeing is somewhat akin to the "latent" mode in the emmeans support for such as ordinal::clm models. On the latent scale, the origin and scale parameter are not uniquely identified, but it still allows one to assess comparisons among factor levels.

> contrast(EMM, "trt.vs.ctrl")
Note: Use 'contrast(regrid(object), ...)' to obtain contrasts of back-transformed estimates
 contrast estimate lower.HPD upper.HPD
 6 - 4        1.02     0.125      1.92
 8 - 4        1.25     0.396      2.06

Note: contrasts are still on the probit scale 
Point estimate displayed: median 
HPD interval probability: 0.95 

The way this is scaled, the Dunnett-style comparisons are the same as the EMMs, because the first level is a hard zero.

paul-buerkner commented 3 years ago

Thanks for your input! I am glad sometimes things appear to solve themselves. :-D

I agree that we are now doing the latent approach. I have relabled this issue a feature and leave it open so that we will be reminded of implemented the per-threshold option as well at some point.

paul-buerkner commented 8 months ago

Closing this issue for now to reduce the load of the issue tracker. If someone has time and wants to work on this, feel free to write here and I am happy to reopen.

benwhalley commented 6 months ago

I'm aware the the concept of estimating a group 'mean' from an ordinal outcome is somewhat problematic, but I currently have some likert-style (1-7) data which I'd like to first model with family=cumulative, and then summarise using emmeans.

This currently looks like it works, but I've come across a quirk when using emmeans this way. A reproducible example is below.

set.seed(123)
df <- tibble(y = ordered(as.character(sample(1:7, 1000, replace=T))), 
             x=sample(letters[1:3], 1000, replace=T)) %>% 
  mutate(y_num = as.numeric(y))
df

m <- brm(y~x, data=df, family = cumulative)
emmeans(m, "x", type="response") %>% 
  as.tibble() %>% 
  pander()

df %>% 
  group_by(x) %>% 
  summarise(mean(y_num)) %>% 
  pander()

The output is as follows:

--------------------------------------
 x   response   lower.HPD   upper.HPD 
--- ---------- ----------- -----------
 a     0.5         0.5         0.5    

 b    0.4976     0.4341      0.5646   

 c    0.5196     0.4538      0.5834   
--------------------------------------

-----------------
 x   mean(y_num) 
--- -------------
 a      3.991    

 b      3.981    

 c      4.086    
-----------------

I think given the way the data are constructed these two tables should be pretty similar. However this example matches my experience with my own data that the estimates are always 1 point higher than the means of as.numeric(y).

rvlenth commented 6 months ago

I can't really comment on this because the actual results from emmeans are masked by filtering them through as.tibble and pander (whatever that is). I am not a "tidy" person, and all this filtering of the results destroys the annotations that were so carefully added by my summary methods before you even get to see them.

All that said, I think what we see here is like apples compared with oranges. By default, the emmeans results for an ordinal model are on the latent scale (though, that said, I'm confused by why they are not all near zero). I suspect you might get more comparable results by adding mode = "mean.class" to the emmeans() call.

PS -- I am unable to fit the model for some reason, even though I reinstalled it immediately beforehand:

> m <- brm(y~x, data=df, family = cumulative)
Compiling Stan program...
Error in `process_initialize(self, private, command, args, stdin, stdout, …`:
! Native call to `processx_exec` failed
Caused by error in `chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …`:
! Command 'mingw32-make.exe' not found @win/processx.c:982 (processx_exec)
benwhalley commented 6 months ago

Yes, by default emmeans results are on the latent scale, with a reference category set to zero. I added type="response" to try and get responses on the original ordered scale. I'm not sure what that actually produced.

FWIW I think it would be perfectly reasonable for emmeans to just refuse to work with type="response", but it if it does produce numbers it should be clear what they are. I suspect it might be quite a common case that people have ordered responses on a 1-7 or 1-10 scale and would like to reconstruct those response for a given set of predictors after modelling, and even take summaries like the mean. I can see that it's not all that well defined, though.

rvlenth commented 6 months ago

but it if it does produce numbers it should be clear what they are

Exactly. Which is why it's so sad that so many users throw away that information when they think they are "tidying" up the output. Package developers write print and summary methods for a reason.

In addition, type = "response" does have meaning. It says that if there is a response transformation or link, the results are back-transformed. But the latent scale is not a transformation, so type = "response" has no effect. A user can't just make up features and expect them to work the way they wish; they have to look at the documentation and see what is appropriate and if what they want is provided.

benwhalley commented 6 months ago

I can see you wouldn't want people to lose the messages, but you do see them interactively even when coercing to a tibble. However, they aren't super-informative in this specific case. Running emmeans(m, "x", type="response") produces this:

Warning: brms' emmeans support for ordinal models is experimental and currently ignores the threshold parameters.Warning: brms' emmeans support for ordinal models is experimental and currently ignores the threshold parameters. x response lower.HPD upper.HPD
 a    0.500     0.500     0.500
 b    0.498     0.434     0.565
 c    0.520     0.454     0.583

I get that it's experimental, but I'm not really sure what "currently ignores the threshold parameters" means. You get exactly the same message if you omit the type="response" argument, so it's not an effective warning that the results are on a latent scale (default) or somewhat undefined.

Also, as an aside, as a package user it's always pretty annoying to me that tabular result isn't either dataframe or can be treated like a dataframe. I don't exclusively use the tidyverse, but the pattern of treating the output of modelling and similar functions as data to be manipulated is great. It's so common to want to aggregate results from different models and print them in a consistent way. (e.g. in Rmarkdown). I don't think that pattern is incompatible with printing nice messages to the console. The brms hypothesis function would be a good example here — it outputs text messages and information, but returns actual results as values in a df.

e.g. in this case, it would be nice if you could write:

.emmeans(m, "x") %>% filter(emmean>2)

In fact, it would be even nicer if the column names adhered to the tidy standard and didn't change around (even if the print methods did show different names), because that leads to less special-casing in users' code. E.g. emmean could consistently be named estimate in the df return value.

rvlenth commented 6 months ago

Also, as an aside, as a package user it's always pretty annoying to me that tabular result isn't either dataframe or can be treated like a dataframe.

You seem to be making an assumption that this is not true. In fact, summary.emmGrid() produces an annotated object that inherits from data.frame and whose print method displays the data frame with unexaggerated precision along with the annotations. For ordinal models fitted with ordinal, *MASS, and some others, there are several post-processing options that include probabilities, cumulative probabilities, reverse cumulative, and means on the 1,2,...,k scale of the ordinal response (see https://cran.r-project.org/web/packages/emmeans/vignettes/sophisticated.html#ordinal). Those post-processing options are not currently incorporated in the brmsfit support.

It would create utter chaos among my users if I changed the column names to some tidy "standard". I do know that the broom package has methods for emmGrid and emm_summary objects and can be used to rename output to conform to some kind of standard. But I know of no "tidy" standards for including informative annotations, which is something I have worked hard to include in my summaries.

benwhalley commented 6 months ago

Sorry - I really hadn't mean to drag this down into bikeshedding about return formats. I really do appreciate the concise print methods that emmeans has, and it's not so hard to convert to df when aggregating results over multiple runs. Agreed that it would be nice to have a tidy (or any) standard for adding annotations to returned objects. I tend to do this informally in my own code quite a lot.

I think my questions at this point are i) what does emmeans actually return when applied to brms ordinal models as I did above, and ii) is it possible to get the cool modes like ='prob' and mode = "mean.class" for brms models, as described in that vignette? Perhaps @paul-buerkner is best placed to answer though?

venpopov commented 6 months ago

judging by the numbers, if I had to guess, in this case type=response treats the cumulative model like logistic regression with a single threshold at the grand mean, so you end up with values near 0.5? If that's the case, it seems like a bug - if type=response doesn't work with brms cumulative models because it is ignoring the thresholds, it should just give an error or a warning that it the option is ignored and responses are on the latent scale.

venpopov commented 6 months ago

Yes, this is what it's doing. a is your reference level, so it gets a fixed estimate of 0 on the latent scale, and the b and c are relative to it. Then emmeans with type=response just applies the inv_logit transformation as if its logistic regression. You can see this more clearly if you exagerate the differences in the data:

set.seed(123)
df <- data.frame(
  latent = c(rnorm(1000, -2, 1),
             rnorm(1000, 0, 1),
             rnorm(1000, 2, 1)),
  x = rep(letters[1:3], each=1000)
)
thr <- c(-2.5, -1.5, -0.5, 0.5, 1.5, 2.5)
df$y <- cut(df$latent, breaks = c(-Inf, thr, Inf), labels = 1:7)
df$y <- as.numeric(df$y)

m <- brm(y~x, data=df, family = cumulative)
emmeans(m, "x", type="response")
image

The message is quite explicit actually: results are backtransformed from the logit scale, ignoring threshold parameters.

without type=response:

image
brms::inv_logit_scaled(3.52)
[1] 0.9712515
venpopov commented 6 months ago

btw @benwhalley you are confusing the print output format of emmeans with the object that it produces. The return value of emmeans is a complex S4 objects, but when you print it, it internally calls summary(), which does produce a data.frame:

image

So you can in fact do operations like filter, but you have to do it on summary(emmeans(...)).

When you do as.data.frame(emmeans(...)), you get the same result as from summary() - an annotated data.frame with attributes which store additional information such as the scale

image

The reason you missed those messages is because as.tibble() strips the summary_emm class, and does not know that it should print the extra info:

image

So your note that "you see the warnings" interactively was only partially correct - you miss other important information that @rvlenth pointed out

rvlenth commented 6 months ago

The term "bikeshedding" means arguing over trivial matters. But this shows it wasn't trivial, and the user would have gotten a clear answer to the question by just looking at the unfiltered results from emmeans.

More and more, I am recalling that there really is no ordinal-model-specific emmeans support for brmsfit models. What we see here is just what comes out in the wash when we treat it as a glm, and that turns out to be an interpretation of the model coefficients, which comes out pretty much on the latent scale, but with the reference class constrained at zero rather than the grand mean set to zero as in emmeans's methods for ordinal::clm and MASS::polr models. However, be careful, because often, cumulative logit models have their signs reversed, so that the class with the highest latent mean will actually have the lowest estimate. The sign reversal is related to modeling the cumulative probabilities via $\mathrm{logit}(y - \eta)$ with $\eta$ being the linear predictor.

If summarized with type = "response", the estimates are simply back-transformed with the link function. The special ordinal-support code uses a mode (not type) argument to delineate them.