Question or maybe a bug regarding 'by'

vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference

https://marginaleffects.com

Other

454 stars 47 forks source link

Question or maybe a bug regarding 'by' #558

Closed andymilne closed 1 year ago

andymilne commented 1 year ago

Apologies if this is part question partly a possible bug report. This link https://vincentarelbundock.github.io/marginaleffects/articles/predictions.html#average-adjusted-predictions-by-group implies that the two methods shown (using the by argument or using dplyr's summarise) should produce the same means. In my example, it doesn't and I don't understand why. If you need replicable code, I can provide but maybe the example below is sufficient -- the model is a brms model with the beta_binomial family:

> prd_CERAD_0_1 <- 
+   predictions(
+     mdl_CERAD_0_1,
+     by = "BscClss",
+     newdata = datagridcf(BscClss = unique))
> summary(prd_CERAD_0_1)
  BscClss Predicted CI low CI high
1      QR     56.00  47.41   58.78
2      TK     56.73  49.40   59.03
3      NG     56.37  48.42   58.91

Model type:  brmsfit 
Prediction type:  response 
> 
> prd_CERAD_0_1 <-
+   predictions(
+     mdl_CERAD_0_1, 
+     variables = list(BscClss = unique))
> 
> prd_CERAD_0_1 |>
+   group_by(BscClss) |>
+   summarize(AAP = mean(predicted))
# A tibble: 3 × 2
  BscClss   AAP
  <fct>   <dbl>
1 QR       56.1
2 TK       56.8
3 NG       56.5

vincentarelbundock commented 1 year ago

My guess is that this has to do with the order of operations. Note that we take the mean within each draw, but then report the median of the posterior. Here's a couple simple equivalences for you to play with:

library(brms)
library(dplyr)
library(marginaleffects)

mod <- brm(mpg ~ hp + mo(cyl), data = mtcars)

# simplest case 
predictions(mod)$predicted

pre |>
    posteriordraws() |>
    group_by(rowid) |>
    summarize(predicted = median(draw)) |>
    pull(predicted)

# `by` case
predictions(mod, by = "cyl")$predicted

mod |>
    predictions() |>
    posteriordraws() |>
    group_by(drawid, cyl) |>
    summarize(predicted = mean(draw)) |>
    group_by(cyl) |>
    summarize(predicted = median(predicted)) |>
    pull(predicted)

andymilne commented 1 year ago

Ah yes, that seems to explain it. Thank you. Is there any way to force the summary() function to output the mean of the predictions/point estimates rather than their median? It seems unintuitive to me, at least, to have the mean applied to the draws but the median applied to the "predictions" (the means of the draws).

vincentarelbundock commented 1 year ago

To me, the logic is:

by always gives you the mean of something, unless you specify the byfun argument.
For bayesian analyses, we always report the median of the posterior distribution. My read is that this is most standard (and usually most appropriate) in the field.

So the defaults make sense to me. But of course, I understand that other people may have different needs, which is precisely why there is a posteriordraws() function. You can do stuff like:

pred <- predictions(mod)
draw <- posteriordraws(pred, "PxD")
rowMeans(collapse::fmean(draw, g = mtcars$cyl))

andymilne commented 1 year ago

I didn't know (1) byfun – that's useful. I don't agree with (2); for example, to my understanding, brms usually reports the mean of the posterior unless the robust = TRUE argument is used. Can I request an option be made available for either mean or median? – part of the reason I like to use marginaleffects is precisely to avoid having to deal with the potential errors I might (will) make by manually coding things like rowMeans(collapse::fmean(draw, g = mtcars$cyl)).

vincentarelbundock commented 1 year ago

Yeah, that sounds like a useful feature request. Thanks for pushing a bit.

I added a global option to change the methods to summarize posteriors: center and intervals. It is now documented in a separate section in all the function documentations, and also in the brms vignette. Using the Github dev version you can now do:

library(brms)
library(dplyr)
library(marginaleffects)
options(marginaleffects_posterior_center = mean)

dat <- mutate(mtcars, cyl = factor(cyl))
mod <- brm(mpg ~ hp + cyl, data = dat)

predictions(mod, newdata = datagridcf(cyl = unique), by = "cyl") |>
    tidy() |>
    pull(estimate)

    [1] 25.09969 19.12088 16.56856

predictions(mod, newdata = datagridcf(cyl = unique)) |> 
    posteriordraws() |>
    group_by(cyl) |>
    summarize(predicted = mean(draw)) |>
    pull(predicted)

    [1] 25.09969 19.12088 16.56856

vincentarelbundock commented 1 year ago

FYI, the documentation website is now updated. Two relevant links:

https://vincentarelbundock.github.io/marginaleffects/reference/predictions.html#bayesian-posterior-summaries

https://vincentarelbundock.github.io/marginaleffects/articles/brms.html#bayesian-estimates-and-credible-intervals

andymilne commented 1 year ago

This is excellent – many thanks for implementing this and it's great to have the ETI/HDI option for the credible intervals too.