It would be great if support can be added to compute simultaneous confidence intervals, similar to routines available in emmeans, rms, etc.

Thanks for the suggestion. I agree that this would be an interesting addition.

For transparency, though, I must admit that I have never read very seriously in that field, so I am unlikely to implement this myself. If someone would like to work on this feature, I would be happy to review and advise on implementation.

I would be interested in having a go at this over the next few weeks, particularly using the multcomp package. (I have seen at least two SO questions looking for this behaviour, and it's something I have wanted in my work as well.)

As a workaround in the meantime, I am reasonably sure that multcomp::glht(<prediction-object>) "just works", assuming that there is no back-transformation (e.g. with invlink(link) or transform options). See the example below, taken from Hothorn et al (2024) Simultaneous Inference in General Parametric Models, multcomp package vignette https://cran.r-project.org/package=multcomp/vignettes/generalsiminf.pdf

However, it would be really nice to have this functionality built-in to marginaleffects so that hypotheses with back-transformations also just work.

Heuristics

In a simple case, the signature of the glht function is

glht(model, linfct, 
    alternative = c("two.sided", "less", "greater"), 
    rhs = 0, ...)

Here, model can be any object with a coef() and vcov() method, linfct is an optional matrix specifying the linear hypothesis, alternative identified the alternative hypothesis and rhs identifies the right-hand side of the null hypothesis linfct %*% coef(model) == rhs. If linfct is not given, then it defaults to the identity matrix and the null hypothesis becomes coef(model) == rhs, which is the case I am paying most attention to here.

In the case of the marginaleffects objects (predictions, comparisons, etc.) the coef() method returns the output of some (possibly non-linear) function (I'll call it $\mathbf{c} : \mathbb{R}^p \to \mathbb{R}^k)$ of the underlying model parameters $\hat{\boldsymbol{\theta}}$, and vcov() returns the estimated covariance matrix for those estimates (i.e. J %*% V %*% t(J)), where J is the Jacobian of $\mathbf{c}$ and V is a sensible estimator of the covariance matrix of $\hat{\boldsymbol{\theta}}$ -- marginaleffects optionally lets this be a sandwich estimator with a single argument, which is really cool.

The resulting glht object (using the default hypothesis) can be used to return global $F$- or $\chi^2$-tests (from summary()), adjusted or non-adjusted p-values for individual hypotheses (also from summary()), or adjusted or non-adjusted confidence intervals for each element of coef(model) (using confint()).

Example comparing manual multcomp::glht and marginaleffects::predictions + multcomp::glht

See the example below, taken from Hothorn et al (2024) Simultaneous Inference in General Parametric Models, multcomp package vignette section 6.3 https://cran.r-project.org/package=multcomp/vignettes/generalsiminf.pdf

library(multcomp)
library(marginaleffects)

data("alzheimer", package = "coin")
y <- factor(alzheimer$disease == "Alzheimer",
            labels = c("other", "Alzheimer"))

gmod <- glm(y ~ smoking * gender, data = alzheimer,
            family = binomial())
# summary(gmod)

Using multcomp alone:

# marginaleffects is helpful for getting our linfct matrix K here
fitdata <- datagrid(gender = unique, smoking = unique, model = gmod)

K <- model.matrix(delete.response(terms(gmod)), data = fitdata)
rownames(K) <- with(fitdata, interaction(smoking, gender, sep = ':'))

gmod_ci <- confint(glht(gmod, linfct = K))
gmod_ci$confint <- apply(gmod_ci$confint, 2, binomial()$linkinv)
plot(gmod_ci, xlab = "Probability of Developing Alzheimer (multcomp)",
     xlim = c(0, 1))

Using marginaleffects + multcomp:

pred <- predictions(
  gmod, datagrid(gender = unique, smoking = unique),
  type = 'link' # We will need to back-transform later
)
pred
#> 
#>  gender smoking Estimate Std. Error      z Pr(>|z|)    S  2.5 % 97.5 %
#>  Female   None    -0.394      0.136 -2.908  0.00364  8.1 -0.660 -0.129
#>  Female   <10     -0.357      0.493 -0.724  0.46921  1.1 -1.323  0.609
#>  Female   10-20   -1.006      0.302 -3.332  < 0.001 10.2 -1.597 -0.414
#>  Female   >20      0.154      0.321  0.480  0.63129  0.7 -0.475  0.784
#>  Male     None    -0.316      0.222 -1.421  0.15531  2.7 -0.751  0.120
#>  Male     <10      0.981      0.677  1.449  0.14740  2.8 -0.346  2.308
#>  Male     10-20   -0.956      0.304 -3.145  0.00166  9.2 -1.551 -0.360
#>  Male     >20     -2.037      0.434 -4.693  < 0.001 18.5 -2.888 -1.186
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, gender, smoking 
#> Type:  link

pred_to_glht <- glht(pred)
pred_ci <- confint(pred_to_glht)
pred_ci$confint <- apply(pred_ci$confint, 2, binomial()$linkinv)
plot(pred_ci, xlab = "Probability of Developing Alzheimer (marginaleffects + multcomp)",
     xlim = c(0, 1))


# Check -------------------------------------------------------------------

all.equal(gmod_ci$confint, pred_ci$confint, check.attributes = FALSE)
#> [1] TRUE

^{Created on 2024-08-02 with reprex v2.1.1}

Implementation?

From a perusal of the source, it looks like the p-values and confidence intervals are generally calculated in get_ci() -- though it appears that some CIs and p-values might be calculated earlier (and I haven't worked out where yet).

My vague thought bubble of an idea for implementation would be to update the get_ci() function to get estimates and the covariance matrix (standard errors alone are not enough) from x, and change (say) the p_adjust argument to equivalent test and calpha arguments from multcomp::summary.glht and multcomp::confint.glht. Is there anything else I'd need to be aware of before treading that path?

Thanks for the comments and proposal. I really appreciate that you took so much time to look at the code and understand the package.

I agree that this would be a cool feature to add.

I'm off to bed and haven't looked at this seriously, but at first glance what you write makes a lot of sense, and the things you describe about the package seem correct.

I believe that some of the conditions that would flip the p_overwrite and ci_overwrite flags to TRUE may never actually occur in reality. They may be leftover from previous infrastructure. I'll try to clean those up eventually.

A couple situations where they may be still be important are: (1) when using p_adjust because we don't want to display unadjusted CIs along with adjusted p, and (2) bootstraping with inferences().

In terms of user interface, I'd like us to think about the pros and cons of different options before committing.

One strong preference I have is to limit the number of arguments in the core functions predictions(), comparisons(), and slopes(). I imagine that multcomp::glht accepts quite a few arguments to tweak behavior, such as alpha level, multiple comparison correction method, 1 vs 2-sided tests, etc. I will not want to add all of those arguments in the core functions.

I'm not sure what the full set of alternatives is, but two options:

1) Use (a potentially renamed) p_adjust to adjust both p values and CIs. This would not be very flexible, but probably simple to implement and understand for users.

2) Move p_adjust out of the core functions, and into hypotheses(). As you can see in the docs, that function already has a couple arguments like joint for joint hypothesis tests. But frankly, the implementation for that is a bit sloppy, and the interface is terrible. There's lots of room for improvement here, and hypotheses() could be a nice single point of entry for these kinds of tests, with all the goodies users would expect.

If we go with option 2, this issue could be rename: "Make hypotheses() powerful"

https://marginaleffects.com/man/hypotheses.html

I have to admit, I'm not very familiar with hypotheses(). I just had a play around (following on from my examples above) and it seems to mostly align with glht for a $\chi^2$ test. Interestingly, glht didn't calculate an F statistic with this example -- I think it has something against GLMs for that.

hyp <- hypotheses(pred, joint = TRUE, joint_test = 'chisq')
hyp$p.value
#> [1] 2.416523e-09

glht_test <- summary(pred_to_glht, test = Chisqtest())
glht_test$test$pvalue
#>              [,1]
#> [1,] 2.416523e-09

all.equal(hyp$p.value, drop(glht_test$test$pvalue))
#> [1] TRUE

Aside: it didn't feel immediately clear to me from the documentation that I could just pass a predictions object into hypotheses, and what the default hypothesis = NULL means.

That being said, hypotheses() does seem like a sensible place to give a lot more flexibility for confidence intervals and p-values. So option 2 isn't bad, and as you point out it makes an excellent pun for the issue tracker. Regarding my use cases though, I would tend to prefer some "simple" functionality in predictions() and comparisons(), and for that I would prefer option (1).

Bearing in mind the desire to keep the additional argument count to core functions low or zero, I think the main options I would want in p_adjust are:

Univariate/unadjusted p-values and CIs. This corresponds to p_adjust = NULL in marginaleffects, test = univariate() in multcomp::summary.glht, and calpha = univariate_calpha() in multcomp::confint.glht().
Single-step adjusted p-values and CIs. This is the requested feature for marginaleffects, but corresponds to test = adjusted(type = 'single-step') and calpha = adjusted_calpha() in multcomp's summary and confint, respectively. This could maybe be selected with p_adjust = 'single-step'.

multcomp has some other p-adjustment methods, which I frankly don't know enough about (... and therefore don't use).

I can easily imagine how to implement a Bonferroni correction for both p-values and confidence intervals -- adjust the level to $1 - \alpha/k$ for both, and then use the univariate methods. On the other hand, the Holm methods breaks straight away because it requires ordering the p-values first. A collection of "Holm confidence intervals" would have different (univariate) levels based on the p-value of the estimate (so they now also depend on the RHS of the hypothesis), and this seems either circular or arbitrary to me. I'm also not familiar with Hochberg, Hommel, BH or BY adjustments to comment on them.

So, maybe we allow a third option for the new p_adjust:

p_adjust = 'bonferroni', which reports p-values multiplied by the number of hypotheses $k$, and univariate confidence intervals at the $(1 - \alpha/k)$ level.

Regarding the naming of p_adjust: it might be good to softly/gradually rename to something like power_adjust to indicate it works on both confidence intervals and p-values?

Would it make sense to onbard multcomp as an optional dependency? That library is battle tested, and it would save us from having to re-implement everything from scratch. We can also make that dependency optional, so people don't have to install it if they don't care about multiple comparisons.

In that vision, the hypotheses() function becomes a front-end to multcomp, with some sugar to make common tests easier to execute, and returning objects of marginaleffects class (with pretty print, etc).

If that's a good option, we could proceed in two steps: (1) improve hypotheses() to support all the options we want, (2) decide if some of these options should live in the core functions.

It feels hard to make a call on 2 when we haven't done an exhaustive tour of 1. It might be that a piping workflow works fine and that we don't need any additional argument in the core functions. Or it might be that we do want some arguments there, in which case it is trivial to call hypotheses() at the very end of the function definition for predictions().

Strengthening hypotheses() in line with option (1) seems like a good first step to me. I also like the idea of keeping multcomp as an optional dependency -- though that would mean we don't get to piggyback off it for univariate CIs and p-values (which, admittedly, there is already code for in this package).

I will have a look through hypotheses() to try and get a sense of how it works, and might start playing around with some code.

Sounds great, thanks!

FYI, I'm taking a week off coding and open source, so I might not respond very quickly in the near future.

But I'll be back and active soon.

OK, I played with this a bit tonight. I had never actually worked with the multcomp package. Unsurprisingly (given the authors and the history), it is awesome.

I’m almost thinking that we should deprecate the p_adjust argument in marginaleffects, and simply document how to use multcomp for this purpose. It is so seamless and easy that I’m not sure what building support directly into marginaleffects really brings to the table.

Consider:

library(multcomp)
library(marginaleffects)

mod <- lm(mpg ~ factor(carb) - 1, data = mtcars)
p <-avg_predictions(mod, by = "carb")

# simultaneous intervals
p |>
    glht() |>
    confint()
# 
#    Simultaneous Confidence Intervals
# 
# Fit: NULL
# 
# Quantile = 2.6313
# 95% family-wise confidence level
#  
# 
# Linear Hypotheses:
#         Estimate lwr     upr    
# b1 == 0 15.7900  11.7088 19.8712
# b2 == 0 25.3429  20.4649 30.2208
# b3 == 0 22.4000  18.3188 26.4812
# b4 == 0 16.3000   8.8488 23.7512
# b5 == 0 19.7000   6.7941 32.6059
# b6 == 0 15.0000   2.0941 27.9059


# Bonferroni correction for p values
p |>
    glht() |>
    summary(test = adjusted("bonferroni"))
# 
#    Simultaneous Tests for General Linear Hypotheses
# 
# Linear Hypotheses:
#         Estimate Std. Error z value Pr(>|z|)    
# b1 == 0   15.790      1.551  10.180  < 2e-16 ***
# b2 == 0   25.343      1.854  13.670  < 2e-16 ***
# b3 == 0   22.400      1.551  14.442  < 2e-16 ***
# b4 == 0   16.300      2.832   5.756 5.17e-08 ***
# b5 == 0   19.700      4.905   4.016 0.000354 ***
# b6 == 0   15.000      4.905   3.058 0.013359 *  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# (Adjusted p values reported -- bonferroni method)


# Holm correction for p values
p |>
    glht() |>
    summary(test = adjusted("holm"))
# 
#    Simultaneous Tests for General Linear Hypotheses
# 
# Linear Hypotheses:
#         Estimate Std. Error z value Pr(>|z|)    
# b1 == 0   15.790      1.551  10.180  < 2e-16 ***
# b2 == 0   25.343      1.854  13.670  < 2e-16 ***
# b3 == 0   22.400      1.551  14.442  < 2e-16 ***
# b4 == 0   16.300      2.832   5.756 2.58e-08 ***
# b5 == 0   19.700      4.905   4.016 0.000118 ***
# b6 == 0   15.000      4.905   3.058 0.002227 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# (Adjusted p values reported -- holm method)

multcomp really is a very nice package!

The two features I would like integrated with marginaleffects as a user are:

Support for back-transforming the simultaneous confidence intervals, e.g. when using invlink(link). This is a bit of a manual process with multcomp.
The table style for marginaleffects is great - I like being able to see the values for the grid variables next to the estimates, while b1 == 0 is a little opaque.

Deprecating p_adjust seems sensible to me, and documenting multcomp capability would be a sensible next step while working out if either of the above are worth implementing in eg hypotheses().

PS on my phone and can’t write code right now I’m afraid - preparing to move country and the computer is in pieces.

vincentarelbundock / marginaleffects

Confidence intervals for simultaneous inference #1169

Heuristics

Example comparing manual multcomp::glht and marginaleffects::predictions + multcomp::glht

Using multcomp alone:

Using marginaleffects + multcomp:

Implementation?