Closed bwiernik closed 2 years ago
Glad to hear that!
This is already possible using the hypothesis
argument. See vignette here: https://vincentarelbundock.github.io/marginaleffects/articles/hypothesis.html
In the examples below, note that:
hypothesis
argument does not return intervals for bayesian models. I opened an issue to fix this: #446newdata="mean"
. This means that all variables are held at their means or modes. You can of course change that argument to suit your needs.Is this more or less what you had in mind?
Maybe we can think of a better interface to the hypothesis
argument for this common task?
library(MASS)
library(marginaleffects)
tmp <- tempfile()
download.file("https://github.com/vincentarelbundock/modelarchive/raw/main/data/brms_cumulative_random.rds", tmp)
mod <- readRDS(tmp)
# predictions for each level
predictions(mod, newdata = "mean")
#> rowid type group predicted conf.low conf.high treat period
#> 1 1 response 1 0.788658815 0.3890123064 0.97150120 0 0
#> 2 1 response 2 0.199259206 0.0269274956 0.54827775 0 0
#> 3 1 response 3 0.008050615 0.0007565728 0.05011209 0 0
#> 4 1 response 4 0.003144665 0.0002470459 0.02245343 0 0
#> subject
#> 1 1
#> 2 1
#> 3 1
#> 4 1
# mean predicted outcome for levels 2 and 4
predictions(
mod,
newdata = "mean",
hypothesis = c(0, .5, 0, .5))
#> rowid type term predicted treat period subject
#> 1 1 response custom 0.1012019 0 0 1
# same as above but with string formula
predictions(
mod,
newdata = "mean",
hypothesis = "(b2 + b4) / 2 = 0")
#> rowid type term predicted treat period subject
#> 1 1 response (b2+b4)/2=0 0.1012019 0 0 1
# Is the average of predictions in groups 1 and 2 different from the average of
# predictions in groups 3 and 4?
predictions(
mod,
newdata = "mean",
hypothesis = "(b1 + b2) / 2 = (b3 + b4) / 2")
#> rowid type term predicted treat period subject
#> 1 1 response (b1+b2)/2=(b3+b4)/2 0.4883614 0 0 1
I'm not sure what to do with hypothesis
to compute "p(4 or 5) in treatment group - p(4 or 5) in control group"
First we compute predicted outcome for each level, in each treatment group, just to know in which rows the interesting values are located:
predictions(mod, newdata = datagrid(treat = 0:1))
#> rowid type group predicted conf.low conf.high period
#> 1 1 response 1 0.788658815 3.890123e-01 0.971501197 0
#> 2 2 response 1 0.930939856 6.813507e-01 0.992919563 0
#> 3 1 response 2 0.199259206 2.692750e-02 0.548277747 0
#> 4 2 response 2 0.065748443 6.900328e-03 0.299353050 0
#> 5 1 response 3 0.008050615 7.565728e-04 0.050112092 0
#> 6 2 response 3 0.002287907 1.911668e-04 0.015888530 0
#> 7 1 response 4 0.003144665 2.470459e-04 0.022453426 0
#> 8 2 response 4 0.000883685 6.149795e-05 0.006967036 0
#> subject treat
#> 1 1 0
#> 2 1 1
#> 3 1 0
#> 4 1 1
#> 5 1 0
#> 6 1 1
#> 7 1 0
#> 8 1 1
Then, we add a hypothesis
:
predictions(
mod,
newdata = datagrid(treat = 0:1),
hypothesis = "mean(c(b6 + b8)) - mean(c(b5, b7)) = 0")
#> type term predicted
#> 1 response mean(c(b6+b8))-mean(c(b5,b7))=0 -0.002426048
The b*
shortcuts refer to row names in the data frame produced by predictions()
, when we don’t use hypothesis
.
I’ll think about a better interface and am open to suggestions (ideally expanding hypothesis
, but could introduce a new argument if absolutely necessary).
Oh, okay, so that's what the b*
parameters are.
I am frequently working with response variables with varying numbers of levels (2, 3, 5, 6, 7, etc.) and wanting to compute contrasts of 3–30 treatment groups against a control group. So, I guess I could compute a hypothesis matrix programattically with something like:
variable_levels <- 1:5
levels_to_keep <- 4:5
ntreat <- 15
hyp <- do.call(
"rbind",
lapply(variable_levels, \(i) if (i %in% keep_levels) rbind(-1, diag(ntreat)) else matrix(0, nrow = ntreat + 1, ncol = ntreat))
)
marginaleffects::predictions(model = mod, newdata = newdat, hypothesis = hyp)
Being able to specify that just by providing the levels_to_keep
information would be really helpful.
What do you think is the timeline on adding intervals for Bayesian models with hypothesis
?
Another thing that would be great is if the term
column in the resulting predictions data.frame would contain the column names from the hypothesis
matrix.
A cleaner alternative might be to combine the by
argument with hypothesis
. The by
argument allows us to compute the average prediction in subgroups of the data. For example:
predictions(mod, by = "treat")
#> type treat predicted
#> 1 response 0.5 0.2497124
#> 2 response -0.5 0.2478118
Then, we can simply do:
predictions(mod, by = "treat", hypothesis = "b1 = b2")
#> type term predicted
#> 1 response b1=b2 0.001900596
predictions(mod, by = "treat", hypothesis = "pairwise")
#> type term predicted
#> 1 response Row 2 - Row 1 -0.001900596
If we enrich the by
argument to allow different groupings over which we take the average, then the rest becomes trivial.
Maybe something like (pseudo-code):
predictions(
mod,
by = list(groupA = 1:10, groupB = 11:13),
hypothesis = "pairwise")
I’m not sure exactly how hard it will be to get the intervals, but if it’s not so bad I expect it to be a question of days rather than weeks.
One thing I am seeing currently is that the get_predict.brmsfit()
method does the summarization (median) in the first step (out <- apply(draws, c(2, 3), stats::median)
). That should be reserved until the final step of the computations.
Yes, but the draws are saved in an attribute called "posterior_draws", and that is what we almost always use to do the actual computations for bayesian models. The median is just used for one column of the predictions()
function.
Currently, in get_hypothesis()
, the computations are done directly on the predicted
column and the posterior_draws
attribute isn't used.
So it looks like something like this should work to get draws, median, and intervals for Bayesian models:
draws <- attr(x, "posterior_draws") |>
apply(2, \(x) x %*% hypothesis)
out <- draws |> apply(1, ggplot2::median_hilow) |> data.table::rbindlist()
Sounds good. Not sure if/how this should be integrated with get_ci()
, which allows us to use HDI our ETI
https://github.com/vincentarelbundock/marginaleffects/blob/main/R/get_ci.R#L60
yeah definitely, whatever approach you use to get the median and CI from draws. I just grabbed my go-to function when I was prototyping
draws <- attr(x, "posterior_draws") |>
apply(2, \(x) x %*% hypothesis)
out <- draws |> apply(1, ggplot2::median_hilow) |> data.table::rbindlist()
hypothesis names <- colnames(hypothesis)
if (is.null(hypothesis_names)) hypothesis_names <- "custom"
out <- get_ci_draws(x = data.frame(term = hypothesis_names), conf_level = .95, draw = draws, estimate = "predicted")
With current github main, we now get credible intervals and we can add column names to the hypothesis
matrix to automatically rename the tests.
library(marginaleffects)
tmp <- tempfile()
download.file("https://github.com/vincentarelbundock/modelarchive/raw/main/data/brms_cumulative_random.rds", tmp)
mod <- readRDS(tmp)
The next command gives medians of the posterior distribution of predictions – for each level of the response – when all covariates are held at their means, except the treatment variable treat
:
p <- predictions(
mod,
variables = list(treat = 0:1),
newdata = "mean")
p
#> rowid rowidcf type group predicted conf.low conf.high period subject treat
#> 1 1 1 response 1 0.788658815 3.890123e-01 0.971501197 0 1 0
#> 2 2 1 response 1 0.930939856 6.813507e-01 0.992919563 0 1 1
#> 3 1 1 response 2 0.199259206 2.692750e-02 0.548277747 0 1 0
#> 4 2 1 response 2 0.065748443 6.900328e-03 0.299353050 0 1 1
#> 5 1 1 response 3 0.008050615 7.565728e-04 0.050112092 0 1 0
#> 6 2 1 response 3 0.002287907 1.911668e-04 0.015888530 0 1 1
#> 7 1 1 response 4 0.003144665 2.470459e-04 0.022453426 0 1 0
#> 8 2 1 response 4 0.000883685 6.149795e-05 0.006967036 0 1 1
apply(attr(p, "posterior_draws"), 1, median)
#> [1] 0.788658815 0.930939856 0.199259206 0.065748443 0.008050615 0.002287907 0.003144665 0.000883685
The next command computes contrasts between the predictions above:
lincom <- matrix(
c(-1, 1, 0, 0, rep(0, 4),
0, 0, -1, 1, rep (0, 4)),
ncol = 2,
dimnames = list(NULL, c("Row 2 minus Row 1", "Row 4 minus Row 3")))
predictions(
mod,
hypothesis = lincom,
variables = list(treat = 0:1),
newdata = "mean")
#> rowid rowidcf type term predicted conf.low conf.high period subject treat
#> 1 1 1 response Row 2 minus Row 1 0.1376634 0.02029289 0.31812249 0 1 0
#> 2 2 1 response Row 4 minus Row 3 -0.1285724 -0.27812930 -0.01937696 0 1 1
Note that taking the median of the posterior of differences is very slightly different than taking the difference between the medians of posterior predictions. The order of operations matters a bit:
diff(apply(attr(p, "posterior_draws"), 1, median))[c(1, 3)]
#> [1] 0.1422810 -0.1335108
Also note that you will probably want to normalize the +1s and +1s in your hypothesis matrix if you are collapsing several estimates, since the hypothesis
weights are used to compute a simple linear combination of predictions, and:
1:4 %*% c(-1, -1, 1, 1)
#> [,1]
#> [1,] 4
1:4 %*% c(-0.5, -0.5, .5, .5)
#> [,1]
#> [1,] 2
Finally, collapsing different treatment levels is now super easy, but I have not yet found an element way to collapse reponse levels. I’ll keep thinking about it.
Awesome!
Maybe a groups
argument that takes a hypothesis matrix for the response levels? So I can do c(0, 0, 0, 1, 1)
so that I can do "add up groups 4 and 5"?
Or this could even be accepted by the existing transformation
argument?
The by
argument can already compute the average of predicted outcomes by subgroups of predictors. For example:
library(MASS)
library(marginaleffects)
mod <- polr(factor(gear) ~ hp + factor(cyl), mtcars)
predictions(mod, by = "cyl", type = "probs")
#>
#> Re-fitting to get Hessian
#> type cyl predicted std.error
#> 1 probs 6 0.3333333 3.810461e-12
#> 2 probs 4 0.3333333 1.056994e-12
#> 3 probs 8 0.3333333 1.251927e-12
So this is easy to do for predictors. For response levels, one intriguing possibility would be to pass a data.frame with a special column name to the by
argument. Then we could do (pseudo-code):
by <- data.frame(
group = c(4, 6, 8),
by = c("Group A", "Group A", "Group B"))
predictions(
mod,
by = by,
hypothesis = "reference")
I don't want the average of subgroups of predictors, I want the sum of two levels of the response. group
is the name you used for the column indicating the category in an ordinal or categorical response model.
My goal is to compute the combined percentage of responding eg either Agree or Strongly Agree on a Likert scale
Yes, I understand. In my last pseudo-code example, you would collapse the response levels 4 and 6 (group A), and compare them to response level 8 (group B).
predictions(
mod,
by = data.frame(group = rep(4:5, 11), by = rep(0:11, each = 2))
hypothesis = "reference")
That would compare 4+5 for each group 1-10 against control (0). Works for me
This should now work for both frequentist and bayesian models. Let me know if you run into issues when you try it.
library(nnet)
library(marginaleffects)
mod <- multinom(factor(gear) ~ mpg + am * vs, data = mtcars, trace = FALSE)
predictions(mod, type = "probs") |> head()
#> rowid type group predicted std.error gear mpg am vs
#> 1 1 probs 3 3.623918e-05 2.002490e-03 4 21.0 1 0
#> 2 2 probs 3 3.623918e-05 2.002490e-03 4 21.0 1 0
#> 3 3 probs 3 9.347603e-08 6.911938e-06 4 22.8 1 1
#> 4 4 probs 3 4.044657e-01 1.965452e-01 3 21.4 0 1
#> 5 5 probs 3 9.999714e-01 1.246217e-03 3 18.7 0 0
#> 6 6 probs 3 5.183336e-01 2.898025e-01 3 18.1 0 1
predictions(mod, type = "probs", by = "group")
#> type group predicted std.error
#> 1 probs 3 0.4687522 0.04042542
#> 2 probs 4 0.3750010 0.06141754
#> 3 probs 5 0.1562469 0.04624265
by <- data.frame(
group = c("3", "4", "5"),
by = c("3,4", "3,4", "5"))
predictions(mod, type = "probs", by = by)
#> type predicted std.error by
#> 1 probs 0.4218766 0.02312133 3,4
#> 2 probs 0.1562469 0.04624265 5
predictions(mod, type = "probs", by = by, hypothesis = "sequential")
#> type term predicted std.error
#> 1 probs 5 - 3,4 -0.2656297 0.06936398
@vincentarelbundock You've got a browser()
left in average_draws() on master.
ooops, sorry about that. should be fixed now. I added a test to catch this.
@vincentarelbundock Sorry if I'm being thick, but I'm not sure I understand the by
syntax. Could you show me an example with this model for differences in "proportion of gear = 4/5
" for cyl = 6
vs cyl = 4
and cyl = 8
vs cyl = 4
?
library(nnet)
library(marginaleffects)
mod <- multinom(factor(gear) ~ mpg + factor(cyl), data = mtcars, trace = FALSE)
@bwiernik I’m not sure I understand exactly what you mean. But here are a few options using the very latest version from Github:
library(nnet)
library(marginaleffects)
nom <- multinom(factor(gear) ~ mpg + factor(cyl), data = mtcars, trace = FALSE)
predictions(nom, by = c("group", "cyl"), hypothesis = "sequential")
#> type term predicted std.error
#> 1 probs 3,4 - 3,6 -0.19430915 0.1890545
#> 2 probs 3,8 - 3,4 0.76527848 0.1248376
#> 3 probs 4,6 - 3,8 -0.28557415 0.2075463
#> 4 probs 4,4 - 4,6 0.15566583 0.2292554
#> 5 probs 4,8 - 4,4 -0.72669195 0.1338590
#> 6 probs 5,6 - 4,8 0.14263407 0.1321941
#> 7 probs 5,4 - 5,6 0.03864332 0.1757888
#> 8 probs 5,8 - 5,4 -0.03858654 0.1478195
by <- expand.grid(
group = 3:5,
cyl = c(4, 6, 8),
stringsAsFactors = FALSE) |>
# define labels
transform(by = ifelse(
group %in% 3:4,
sprintf("3/4 Gears & %s Cylinders", cyl),
sprintf("5 Gears & %s Cylinders", cyl)))
predictions(nom, by = by)
#> type predicted std.error by
#> 1 probs 0.4285648 0.06606515 3/4 Gears & 6 Cylinders
#> 2 probs 0.4092432 0.05797517 3/4 Gears & 4 Cylinders
#> 3 probs 0.4285364 0.04584405 3/4 Gears & 8 Cylinders
#> 4 probs 0.1428704 0.13213030 5 Gears & 6 Cylinders
#> 5 probs 0.1815137 0.11595035 5 Gears & 4 Cylinders
#> 6 probs 0.1429271 0.09168810 5 Gears & 8 Cylinders
predictions(nom, by = by, hypothesis = "pairwise")
#> type term predicted
#> 1 probs 3/4 Gears & 4 Cylinders - 3/4 Gears & 6 Cylinders -1.932166e-02
#> 2 probs 3/4 Gears & 8 Cylinders - 3/4 Gears & 6 Cylinders -2.839251e-05
#> 3 probs 5 Gears & 6 Cylinders - 3/4 Gears & 6 Cylinders -2.856945e-01
#> 4 probs 5 Gears & 4 Cylinders - 3/4 Gears & 6 Cylinders -2.470511e-01
#> 5 probs 5 Gears & 8 Cylinders - 3/4 Gears & 6 Cylinders -2.856377e-01
#> 6 probs 3/4 Gears & 8 Cylinders - 3/4 Gears & 4 Cylinders 1.929327e-02
#> 7 probs 5 Gears & 6 Cylinders - 3/4 Gears & 4 Cylinders -2.663728e-01
#> 8 probs 5 Gears & 4 Cylinders - 3/4 Gears & 4 Cylinders -2.277295e-01
#> 9 probs 5 Gears & 8 Cylinders - 3/4 Gears & 4 Cylinders -2.663160e-01
#> 10 probs 5 Gears & 6 Cylinders - 3/4 Gears & 8 Cylinders -2.856661e-01
#> 11 probs 5 Gears & 4 Cylinders - 3/4 Gears & 8 Cylinders -2.470227e-01
#> 12 probs 5 Gears & 8 Cylinders - 3/4 Gears & 8 Cylinders -2.856093e-01
#> 13 probs 5 Gears & 4 Cylinders - 5 Gears & 6 Cylinders 3.864332e-02
#> 14 probs 5 Gears & 8 Cylinders - 5 Gears & 6 Cylinders 5.678502e-05
Ah, thanks, I see what the by
column is for now!
One thing I notice is that if there are rows in the predictions that don't belong to any row of by
, they are returned collapsed into a by = NA
row. That should probably be filtered out.
library(nnet)
library(marginaleffects)
nom <- multinom(factor(gear) ~ mpg + factor(cyl), data = mtcars, trace = FALSE)
predictions(nom, by = c("group", "cyl"), hypothesis = "sequential")
#> type term predicted std.error statistic p.value conf.low
#> 1 probs 3,4 - 3,6 -0.19430915 0.1890545 -1.0277945 3.040465e-01 -0.5648491
#> 2 probs 3,8 - 3,4 0.76527848 0.1248376 6.1301914 8.777340e-10 0.5206013
#> 3 probs 4,6 - 3,8 -0.28557415 0.2075463 -1.3759540 1.688359e-01 -0.6923574
#> 4 probs 4,4 - 4,6 0.15566583 0.2292554 0.6790061 4.971340e-01 -0.2936666
#> 5 probs 4,8 - 4,4 -0.72669195 0.1338590 -5.4287848 5.673903e-08 -0.9890509
#> 6 probs 5,6 - 4,8 0.14263407 0.1321941 1.0789742 2.805992e-01 -0.1164617
#> 7 probs 5,4 - 5,6 0.03864332 0.1757888 0.2198282 8.260050e-01 -0.3058963
#> 8 probs 5,8 - 5,4 -0.03858654 0.1478195 -0.2610381 7.940631e-01 -0.3283075
#> conf.high
#> 1 0.1762308
#> 2 1.0099557
#> 3 0.1212091
#> 4 0.6049982
#> 5 -0.4643330
#> 6 0.4017298
#> 7 0.3831830
#> 8 0.2511344
by <- expand.grid(
group = 3:5,
cyl = c(4, 6, 8),
stringsAsFactors = FALSE) |>
# define labels
transform(by = ifelse(
group %in% 3:4,
sprintf("3/4 Gears & %s Cylinders", cyl),
sprintf("5 Gears & %s Cylinders", cyl))) |>
subset(group %in% 3:4)
predictions(nom, by = by)
#> type predicted std.error statistic p.value conf.low conf.high
#> 1 probs 0.4285648 0.06606515 6.487003 8.756078e-11 0.2990795 0.5580501
#> 2 probs 0.4092432 0.05797517 7.058938 1.677800e-12 0.2956139 0.5228724
#> 3 probs 0.4285364 0.04584405 9.347700 8.957431e-21 0.3386837 0.5183891
#> 4 probs 0.1561788 0.06350874 2.459171 1.392583e-02 0.0317040 0.2806537
#> by
#> 1 3/4 Gears & 6 Cylinders
#> 2 3/4 Gears & 4 Cylinders
#> 3 3/4 Gears & 8 Cylinders
#> 4 <NA>
predictions(nom, by = by, hypothesis = "pairwise")
#> type term predicted
#> 1 probs 3/4 Gears & 6 Cylinders - 3/4 Gears & 4 Cylinders 1.932166e-02
#> 2 probs 3/4 Gears & 6 Cylinders - 3/4 Gears & 8 Cylinders 2.839251e-05
#> 3 probs 3/4 Gears & 6 Cylinders - NA 2.723860e-01
#> 4 probs 3/4 Gears & 4 Cylinders - 3/4 Gears & 8 Cylinders -1.929327e-02
#> 5 probs 3/4 Gears & 4 Cylinders - NA 2.530643e-01
#> 6 probs 3/4 Gears & 8 Cylinders - NA 2.723576e-01
#> std.error statistic p.value conf.low conf.high
#> 1 0.08789439 0.2198281561 0.826004991 -0.15294817 0.1915915
#> 2 0.08041288 0.0003530841 0.999718280 -0.15757795 0.1576347
#> 3 0.11053165 2.4643255884 0.013727135 0.05574792 0.4890240
#> 4 0.07390977 -0.2610381284 0.794063104 -0.16415376 0.1255672
#> 5 0.10961879 2.3085850516 0.020966618 0.03821542 0.4679132
#> 6 0.09906106 2.7493910457 0.005970611 0.07820148 0.4665137
Created on 2022-09-07 by the reprex package (v2.0.1)
Hmm, let me think about this. My gut reaction is that it's easy enough for the user to filter out, and that there may be some diagnostic value in leaving this in: in some cases, users may not realize that their by
data frame is incomplete.
That's fair enough, just wanted to be sure it was an intentional choice
I didn't say it was intentional ;)
One place where post-filtering becomes difficult is when hypothesis
is given. Then na.omit()
or friends won't work. Perhaps an argument to remove by = NA
from the predictions before the hypothesis test?
Opened a separate issue to think about this specific problem: https://github.com/vincentarelbundock/marginaleffects/issues/481
Hold on, by
still isn't doing what I am looking for.
I want to sum across the levels of group
, not average across them. Above, for each value of cyl
, the predictions for group = 3
and group = 4
are being averaged. I want them to be summed.
So the expected results should have predicted values:
# A tibble: 3 × 2
cyl predicted
<fct> <dbl>
1 4 0.867
2 6 0.852
3 8 0.660
The context here is that I want predicted
to mean prob(y = Agree OR y = StronglyAgree)
, so predicted = prob(y = Agree) + prob(y = StronglyAgree)
.
I'm starting to use your package a lot and loving it!
One thing I would find extremely useful would be to be able to specify specific categories/groups in categorical/ordinal/multinomial models, as well as to be able to collapse the selected categories.
For example, let's say I've got a cumulative ordinal model where the response is numbered 1:5 for Strongly Disagree to Strongly Agree. I very often was predictions or contrasts for this response for "Answered 4 or 5" (Agree or Strongly Agree). Currently, I compute these manually by:
It would be really awesome if I could specify something like
comparisons(..., categories = c(4, 5), collapse_categories = TRUE)
to have the filtering and collapsing done before the contrasts are computed