vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference
https://marginaleffects.com
Other
457 stars 47 forks source link

Trying to bridge the emmeans gap #1232

Open mattansb opened 1 week ago

mattansb commented 1 week ago

I'm once again trying to switch to {marginaleffects}, but have found two cases that are buggy or still very hard to execute (sorry for mixing a bug report with a "discussion"):

library(afex)
library(emmeans)
library(marginaleffects)

data(md_12.1)
A <- aov_ez("id", "rt", md_12.1, 
            within = c("angle", "noise"))

1. Simple effects (conditional joint tests)

# Trying to re-create this:
joint_tests(A, by = "noise") 
#> noise = absent:
#>  model term df1 df2 F.ratio p.value
#>  angle        2   9   8.143  0.0096
#> 
#> noise = present:
#>  model term df1 df2 F.ratio p.value
#>  angle        2   9  50.706  <.0001

avg_predictions(A, variables = "angle") |> 
  hypotheses(hypothesis = difference ~ pairwise | noise, joint = TRUE)
#> Error: Assertion failed. One of the following must apply:
#>  * checkmate::check_number(hypothesis): Must be of type 'number', not
#>  * 'formula'
#>  * checkmate::check_numeric(hypothesis): Must be of type 'numeric', not
#>  * 'formula'
#>  * checkmate::check_null(hypothesis): Must be NULL

avg_predictions(A, variables = "angle", newdata = datagrid(noise = "absent")) |> 
  hypotheses(hypothesis = "pairwise", joint = TRUE)
#> Error: Assertion failed. One of the following must apply:
#>  * checkmate::check_number(hypothesis): Must be of type 'number', not
#>  * 'character'
#>  * checkmate::check_numeric(hypothesis): Must be of type 'numeric', not
#>  * 'character'
#>  * checkmate::check_null(hypothesis): Must be NULL

# This also fails:
avg_predictions(A, variables = "angle") |> 
  hypotheses(hypothesis = difference ~ pairwise | noise)
#> Error in sanitize_hypothesis_formula(hypothesis): Assertion on 'Right-hand side of `hypothesis` formula' failed: Must be element of set {'reference','sequential','meandev'}, but is 'pairwise'.

2. Interaction contrasts

ems <- emmeans(A, ~ angle + noise)

w <- data.frame(
  "None vs Some" = c(-2, 1, 1) / 2,
  "High vs Med" = c(0, 1, -1)
)

# Trying to re-create this:
contrast(ems, interaction = list(angle = w, noise = "pairwise"))
#>  angle_custom noise_pairwise   estimate   SE df t.ratio p.value
#>  None.vs.Some absent - present     -162 16.2  9  -9.970  <.0001
#>  High.vs.Med  absent - present       84 24.0  9   3.500  0.0067

# Is there an easier way to do this?

avg_predictions(A, variables = c("angle", "noise"),
                hypothesis = \(x) {
                  x |> 
                    dplyr::group_by(noise) |> 
                    dplyr::reframe(
                      term = colnames(w),
                      estimate = t(estimate %*% as.matrix(w))
                    ) |> 
                    dplyr::group_by(angle = term) |> 
                    dplyr::summarize(
                      term = paste0(term[1], "-diff"),
                      estimate = estimate[1] - estimate[2]
                    )
                })
#> 
#>               Term        angle Estimate Std. Error     z Pr(>|z|)    S 2.5 %
#>  High.vs.Med-diff  High.vs.Med        84       24.0  3.50   <0.001 11.1    37
#>  None.vs.Some-diff None.vs.Some     -162       16.2 -9.97   <0.001 75.4  -194
#>  97.5 %
#>     131
#>    -130
#> 
#> Type:  response 
#> Columns: term, angle, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Created on 2024-10-14 with reprex v2.1.1

3. Automatic dfs?

Currently all tests are, by default, wald-z tests, which can be counter-conservative (especially in small samples). Are there any plans for automatic df extraction from the model?

vincentarelbundock commented 1 week ago

Are there any plans for automatic df extraction from the model?

Isn't this sufficient?

avg_predictions(A, df = insight::get_df(A))
vincentarelbundock commented 1 week ago

Unfortunately, the joint test features are currently very limited. I will look into this in the future, and probably delegate some of the computation to the ghlt package, which already works well with marginaleffects.

One tip: Whenever possible, avoid the hypotheses() function and use the hypothesis argument directly in the main function call.

For your interaction, I would say that this requires adopting a radically different point of view.

IMHO, the only people who can reasonably be expected to make sense of your code are extremely advanced emmeans experts.

In contrast, the hyp function I show doesn’t require specifying a weird w contrast matrix, understanding the specific concept of “interaction” at play, or what “pairwise” means in this context.

The code below is obviously more verbose than emmeans, but I would argue that it is also more transparent. Moreover, specifying contrasts by reference to the actual levels feels safer than building a contrast matrix that assumes a certain order in the data.

I guess what I’m saying is that emmeans experts like you probably won’t see much value here. And honestly, they should probably keep using emmeans if it works for them. But for some new users who want to make complex comparisons, I think it may easier to specify contrasts using a language —simple dplyr or base R transformations—that they already know.

Of course, YMMV!

library(afex)
library(emmeans)
library(marginaleffects)

data(md_12.1)
A <- aov_ez("id", "rt", md_12.1, within = c("angle", "noise"))

# emmeans
ems <- emmeans(A, ~ angle + noise)
w <- data.frame(
  "None vs Some" = c(-2, 1, 1) / 2,
  "High vs Med" = c(0, 1, -1)
)
contrast(ems, interaction = list(angle = w, noise = "pairwise"))
>  angle_custom noise_pairwise   estimate   SE df t.ratio p.value
>  None.vs.Some absent - present     -162 16.2  9  -9.970  <.0001
>  High.vs.Med  absent - present       84 24.0  9   3.500  0.0067

# marginaleffects
hyp <- function(x) {
  dplyr::reframe(x,
    term = c("None vs Some", "High vs Med"), 
    estimate = c(estimate[angle == "X0"] - mean(estimate[angle != "X0"]),
                 estimate[angle == "X8"] - estimate[angle == "X4"]),
    .by = "noise")[, 2:3] |>
  dplyr::summarize(estimate = diff(estimate), .by = "term")
}
predictions(A, 
    newdata = datagrid(angle = unique, noise = unique),
    hypothesis = hyp)
> 
>          Term Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
>  None vs Some     -162       16.2 -9.97   <0.001 75.4  -194   -130
>  High vs Med        84       24.0  3.50   <0.001 11.1    37    131
> 
> Type:  response 
> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
mattansb commented 6 days ago

Isn't this sufficient?

avg_predictions(A, df = insight::get_df(A))

Yes, for fixed effects models. rmANOVA of mixed effect models require something other than the residual df.


Whenever possible, avoid the hypotheses() function and use the hypothesis argument directly in the main function call.

I didn't see an option for joint test outside hypotheses()


Re:

extremely advanced emmeans experts

Thanks 😊

Also thanks for the example. I think I might be able to build a function generating function that can bridge the emmeans way and the marginal effects way to some extent. Would that be something you would consider adding?

vincentarelbundock commented 6 days ago

I think I might be able to build a function generating function that can bridge the emmeans way and the marginal effects way to some extent. Would that be something you would consider adding?

That's really intriguing!

Yes, in principle, that's something that I'd consider. But fair warning: I'm annoyingly picky about user interface. One thing I don't want to do is replicate the emmeans user interface. And another is that I want to keep the "spirit" of the user-interface consistent.

What does "spirit" mean? Not sure...

I guess what I'm saying is that it would be important to discuss the ultimate user interface before writing any code, because there's a likelihood I won't merge if we can't find an interface that makes sense and is transparent.

mattansb commented 3 days ago

I'll start super ambitious and we can work our way back.

Broadly defining a contrast as any comparisons between groups of estimates, possibly conditionally. So we would need a function that can define these groups that are to be compared, how to compare them, and what other variable to condition on.

library(marginaleffects)

data("mtcars")
mtcars$cyl <- factor(mtcars$cyl)

mod <- lm(mpg ~ cyl * am, data = mtcars)

contr <- function(..., comparison = "difference", by = NULL, transform = identity, aggregate = mean) {
  # magic goes here
}

# generates a function that can be fed to `hypothesis=`
hyp <- contr(
  c(4, 8) %vs% 6,
  4 %vs% 8, 

  comparison = "ratio",
  by = "am"
)

avg_predictions(mod, variables = c("cyl", "am"),
                hypothesis = hyp)

The %vs% operand is a short hand which can be cool to implement instead of list(a,b) to have a %vs% b.

This definition excludes other types of contrasts, such as polynomial contrasts. These can be defined by passing a vector/matrix of contrast weights (in which case the comparison argument is ignored).

vincentarelbundock commented 3 days ago

Interesting.

I think it’s useful to spell something out explicitly. In your examples, there are two very different operations:

  1. Aggregating (or “collapsing”) the categories
  2. Contrasting different estimates

Philosophically, marginaleffects wants to keep these two things separate:

  1. Aggregation is done in the by argument
  2. Contrasting is done in the hypothesis argument

Consider a simple example. First, we fit a model and make predictions for each row of the original data (no aggregation). Since mtcars has 32 rows, we get 32 predictions:

library(marginaleffects)
library(tibble)

mod <- lm(mpg ~ factor(cyl) * am, data = mtcars)
predictions(mod) |> nrow()
> [1] 32

Let’s aggregate/average unit-level predictions for each cyl value:

avg_predictions(mod, by = "cyl")
> 
>  cyl Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
>    6     19.7      1.146 17.2   <0.001 218.5  17.5   22.0
>    4     26.7      0.914 29.2   <0.001 618.7  24.9   28.5
>    8     15.1      0.810 18.6   <0.001 255.0  13.5   16.7
> 
> Type:  response 
> Columns: cyl, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Now, say you want to “collapse” the 4/6 and 8 cyl groups into two buckets. Fundamentally, this is just another aggregation/averaging operation, which we can do with the by argument:

bydf = tribble(
    ~ cyl, ~ by,
    4, "small",
    6, "small",
    8, "large"
)

avg_predictions(mod, by = bydf)
> 
>     By Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
>  small     24.0      0.715 33.5   <0.001 816.9  22.6   25.4
>  large     15.1      0.810 18.6   <0.001 255.0  13.5   16.7
> 
> Type:  response 
> Columns: estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, by

As noted above, the philosophy is to do contrasts using the hypothesis argument. So one idiomatic way to do what you wanted is:

avg_predictions(mod, by = bydf, hypothesis = ~ sequential)
> 
>         Hypothesis Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
>  (large) - (small)    -8.87       1.08 -8.21   <0.001 52.0   -11  -6.75
> 
> Type:  response 
> Columns: hypothesis, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Finally, it is true that hypothesis can do everything at once. It’s a neat part of the design that this argument accepts arbitrary functions.

But I think it’s useful to think about the operations separately, because any implementation of a new contrasting interface that goes solely through hypothesis will still have to combine the two steps: “aggregation” and “contrasting”. And I think that one of the big strengths of marginaleffects is that it clearly separates these conceptually distinct operations. This makes the user interface very composable and flexible.

What I want to avoid is having two completely parallel way to approach the same two-step problem.

mattansb commented 3 days ago

This is interesting - for a single contrast that really gets us 90% there - we would just need a way to add interaction contrasts and conditional contrasts.

But - contrasts often appear as families of contrasts. E.g., pairwise, consecutive, orthogonal... Which the by=bydf method (as far as I can tell) does not allow, right?

vincentarelbundock commented 3 days ago

Which the by=bydf method (as far as I can tell) does not allow, right?

This feature would be trivial to add. For example, this one-liner PR passes almost all existing tests, and allows this:

https://github.com/vincentarelbundock/marginaleffects/pull/1235

> mod <- lm(mpg ~ factor(cyl) * am, data = mtcars)
> bydf = tribble(
+     ~ cyl, ~ by,
+     4, "a",
+     6, "a",
+     8, "b",
+     4, "x",
+     6, "y",
+     8, "y"
+ )
> avg_predictions(mod, by = bydf)

 By Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
  a     24.0      0.715 33.5   <0.001 816.9  22.6   25.4
  x     26.7      0.914 29.2   <0.001 618.7  24.9   28.5
  y     16.6      0.662 25.2   <0.001 461.6  15.4   17.9
  b     15.1      0.810 18.6   <0.001 255.0  13.5   16.7

Type:  response
Columns: estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, by
mattansb commented 3 days ago

Yes, but that would require typing out that bydf for each contrast in the family. And would be much more complex for interaction (crossed) contrasts that have multiple family-by-family weighting.

The docs do say that more complex aggregation -

...you can use the FUN argument of the hypotheses() function.

So I'm not sure that allowing for this overlap is inconsistent.

vincentarelbundock commented 3 days ago

Yes, but that would require typing out that bydf for each contrast in the family. And would be much more complex for interaction (crossed) contrasts that have multiple family-by-family weighting.

I'd certainly be interested in more ergonomic ways to specify the by argument.

vincentarelbundock commented 3 days ago

BTW, I’m not sure if you were aware that we can use matrices in the hypothesis argument directly.

I could easily imagine someone creating a utility function to leverage that. Of course, one may need access to the data frame itself to know the correct indices, so perhaps you’re right and that has to happen inside the custom hypothesis function.

library(afex)
library(emmeans)
library(marginaleffects)

data(md_12.1)
A <- aov_ez("id", "rt", md_12.1, within = c("angle", "noise"))

h = matrix(c(
    -1,1,.5,-.5,.5,-.5,
    -1,2,.1,-.5,.8,-.5),
    ncol = 2
)
colnames(h) <- c("None vs. Some", "Something else")
predictions(A,
    hyp = h,
    newdata = datagrid(angle = unique, noise = unique)
)
> 
>            Term Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
>  None vs. Some      -162       16.2 -9.97   <0.001 75.4  -194   -130
>  Something else      284       38.1  7.47   <0.001 43.5   210    359
> 
> Type:  response 
> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
mattansb commented 2 days ago

Here are some more thoughts:

  1. Can we support a function / matrix passed in comparisons(variables = list(term = <thing>))? This would essentially solve both the conditional contrasts (by setting by=) and the interaction contrasts (by setting cross=TRUE).

Really these is some overlap between the comparisons() function and some of the functionality offered by hypothesis=, maybe aligning those two can resolve this to some extent?


  1. An alternative would be to convert marginaleffects objects to emmgrids - this is surprisingly easy:
library(marginaleffects)
library(emmeans)
#> Welcome to emmeans.
#> Caution: You lose important information if you filter this package's results.
#> See '? untidy'

mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)

mod <- lm(mpg ~ cyl * am, mtcars)

(p <- avg_predictions(mod, variables = c("cyl", "am")))
#> 
#>  am cyl Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>   1   6     20.6      1.751 11.75   <0.001 103.4  17.1   24.0
#>   1   4     28.1      1.072 26.19   <0.001 499.7  26.0   30.2
#>   1   8     15.4      2.144  7.18   <0.001  40.4  11.2   19.6
#>   0   6     19.1      1.516 12.61   <0.001 118.8  16.2   22.1
#>   0   4     22.9      1.751 13.08   <0.001 127.5  19.5   26.3
#>   0   8     15.1      0.875 17.19   <0.001 217.7  13.3   16.8
#> 
#> Type:  response 
#> Columns: cyl, am, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

emmobj(coef(p), vcov(p), lapply(insight::get_datagrid(p), unique))
#>  cyl am estimate    SE df asymp.LCL asymp.UCL
#>  6   1      20.6 1.751 NA      17.1      24.0
#>  4   1      28.1 1.072 NA      26.0      30.2
#>  8   1      15.4 2.144 NA      11.2      19.6
#>  6   0      19.1 1.516 NA      16.2      22.1
#>  4   0      22.9 1.751 NA      19.5      26.3
#>  8   0      15.1 0.875 NA      13.3      16.8
#> 
#> Confidence level used: 0.95

(This can be expanded to also support estimates created from Bayesian models.)

This could be a feature for power users. You would then have a few tiers of contrast complexity:

  1. Standard contrasts / conditional / crossed contrasts with comparisons()
  2. Custom simple contrasts with hypothesis=<matrix>
    • Also allows for contrasting slopes
  3. Custom contrasts / conditional / crossed contrasts with as.emmGrid() but limited to differences contrasts
    • Also allows for contrasting slopes
  4. Custom contrasts / conditional / crossed contrasts with hypothesis=<function> that really can do whatever - any type of contrast (ratio, lift, ...) with what ever transformation. Really this is the most advanced option, for super power users, as it allows any arbitrary anything.
vincentarelbundock commented 2 days ago

Can we support a function / matrix passed in comparisons(variables = list(term = ))?

That’s an interesting idea, but I don't think it is possible to implement without fundamental (and enormous) changes to the current code architecture. My guess is that your mental model for comparisons() is something like:

  1. Use predictions()
  2. Make contrasts between the results in 1.

But behind the scenes, that's not at all what is going on. What comparisons() does is create two entirely different data frames, with different values of the focal variable. Then, it calls predict() on both data frames, and compares the two:

library(marginaleffects)

mod <- lm(mpg ~ hp + am, mtcars)

d1 <- data.frame(hp = 100, am = 0)
d2 <- data.frame(hp = 101, am = 1)
predict(mod, d2) - predict(mod, d1)
>        1 
> 5.218198

comparisons(mod,
    variables = c("hp", "am"),
    cross = TRUE,
    newdata = datagrid(hp = 100, am = 0))
> 
>  C: am C: hp  hp am Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
>  1 - 0    +1 100  0     5.22       1.08 4.83   <0.001 19.4   3.1   7.34
> 
> Type:  response 
> Columns: rowid, term, contrast_am, contrast_hp, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, hp, am, predicted_lo, predicted_hi, predicted, mpg

We are always doing a “counterfactual comparison” between two specific values for each focal predictor. Two specific values of am: 0 and 1. Two specific values of hp: 100 and 101.

We have to fill actual data frames with those specific values, and feed it to predict() to get the output. We cannot insert 100 and 101 into cell, because predict() won't give us a prediction.

comparisons() is designed for G-computation, that is, to estimate the effect of changes between observed levels of focal predictors. It cannot do weighted contrasts between marginal means. For that, you really need to go through predictions() and hypothesis()

An alternative would be to convert marginaleffects objects to emmgrids - this is surprisingly easy

This is super cool! I had no idea.

But what is the value proposition here? If someone is deep enough into contrasts to know those more advanced emmeans commands, why wouldn’t they use emmeans from the start? What are the new features that they couldn’t get before?

mattansb commented 2 days ago

My mental model for comparisons() is exactly what you wrote - I still think it could work, though not sure it would be worth the effort?

mtcars$cyl <- factor(mtcars$cyl)

mod <- glm(am ~ cyl + hp,
           family = binomial(),
           data = mtcars)

d1 <- data.frame(hp = 100:101, cyl = "4")
d2 <- data.frame(hp = 100:101, cyl = "6")
d3 <- data.frame(hp = 100:101, cyl = "8")

w <- c(-2, 1, 1)/2

cbind(predict(mod, d1, type = "response"), 
      predict(mod, d2, type = "response"), 
      predict(mod, d3, type = "response")) %*% w
#>         [,1]
#> 1 -0.7293757
#> 2 -0.7304380

But what is the value proposition here?

Well, {marginaleffects} supports more models (afaik), and produces the estimates (adjusted predictions, AMEs...) using a completely different approach.

vincentarelbundock commented 2 days ago

Aaah, I see, thanks for clarifying!

To summarize, here are the proposals I've seen in this thread, and my current assessment:

comparisons() accepts 3 values and a function in `variables

The idea of allowing more than two values in the counterfactual comparisons is interesting, but I'm afraid it is not very realistic. The two counterfactuals assumption is deeply embeded in the architecture design, and changing that would be a massive endeavour. Think: 100+ hours of work.

Unfortunately, I don't have the bandwidth for that, and I don't think anyone else currently knows the code base well-enough to jump into that.

(I'd love to get more contributors! But starting on smaller projects makes more sense.)

by= ergonomics

Make it more ergonomic to collapse categories, without having to write data frames by hand.

I like this idea a lot and would love to see that. I don't currently have an idea for how to make this easier, but happy to consider alternatives to data frames. List of lists feel even more complicated, but maybe there's a solution somewhere.

I think that's a very common use-case, so it would be worth it to invest some time in finding a solution here.

hypothesis= helper function

That's your original proposal. hypothesis accepts arbitrary functions, which is powerful but cumbersome. We could implement a helper function to make this easier.

Once upon a time, I tried to do exactly that. See here.

At first, this felt like a good idea, but I quickly hit a wall. There are so many challenges: (1) people's needs are infinite, so the helper needs to be super flexible; (2) automatically labelling contrasts is challengingl; (3) finding a simple but ultra-flexible syntax is hard.

In the end, all the helper functions I wrote had tons of weird arguments. Learning all those arguments and domain-specific syntax just felt more complicated than just defining a custom contrast with dplyr or base R functions.

So I gave up.

I'm not closed to the idea. But your contr() already has 4 arguments, and nothing to handle labels, weights, or cross-contrasts. We'll need to at least double the number of arguments, or really constrain the possibilities.

Maybe we can make that work, but I worry that we'll end up with either (a) a helper with so many arguments that it isn't more convenient than just writing a custom function, or (b) some magic-looking function call that can't be understood by anyone who reads the code, unless they spent hours reading our documentation first.

To be clear, I'm not saying this is impossible. I was just unable to overcome the challenges when I tried.

Convert emmeans objects to marginaleffects

Frankly, I'm not yet convinced that there's that much added value here, because emmeans heroes should probably just keep using it (I don't want to convince anyone to shift!).

But if it's that easy, then maybe we can document this in a vignette. There's not much point in exporting a function that is just a one-liner, but showing this solution to users in documentation might be helpful.

Automatic df

The current document solution for this is to use the df argument:

library(marginaleffects)
mod <- lm(mpg ~ cyl * am, mtcars)
avg_predictions(mod, df = df.residual(mod))

I actually kind of like this. It makes clear exactly where the degrees of freedom come from. It does not require users to type a lot of characters.

Also, my own personal bias is that if the degrees of freedom correction makes a big difference for you, you are have much bigger problems.