vincentarelbundock commented 3 years ago

One of the challenges in statistics (for me) is that different fields often use different words for the same things, or the same words for different things. I'd like to get comfortable with the ggeffects package, and figure out specifically what "marginal effect" means in this context.

The README points to this definition: https://stats.stackexchange.com/tags/marginal-effect/info

This link suggests that the marginal effect is the partial derivative of the conditional expectation function. That description is pretty clear, and it matches what I typically understand to be a "marginal effect." It's consistent with the methodological literatures that I am more familiar with (econ and political science).

However, I still can't figure out if this is actually the quantity that ggeffects estimates. My understanding is that if you want a general mechanism to estimate the partial, there is little choice but to use automatic differentiation. This is what the margins package does, as explained in the technical vignette:

https://cran.r-project.org/web/packages/margins/vignettes/TechnicalDetails.pdf

But I see no automatic differentiation mechanism in the ggeffects repo. Instead, it seems like the package calculates something like "marginal means" or "adjusted predictions", rather than actual marginal effects.

In Stata, for example, you get the adjusted predictions by calling the margins method. But to obtain marginal effects, you must call margin with the dydx(regressor_name) argument. See docs:

https://www.stata.com/features/overview/marginal-analysis/

Am I right to think that ggeffects does not actually compute "marginal effects" as I understand the term?

strengejacke commented 3 years ago

Am I right to think that ggeffects does not actually compute "marginal effects" as I understand the term?

Yes, and you can see this from the readme and in other places in the documentation:

ggeffects is a light-weight package that aims at easily calculating marginal effects (or: estimated marginal means) at the mean or at representative values

I started with using the term "marginal effects", but due to the many different meanings across different fields, I started to be more precise and use "marginal means" etc.

Now we have a new problem, namely one that was also the reason for brms to rename marginal_effects() to conditional_effects(). While ggemmeans() and ggeffect() calculate marginal means / predictions (marginalized over the levels of the population / sample), ggpredict() uses predict() and calculates conditional means / predictions (hold constant at specific levels, not marginalized over all levels).

Thus, I still didn't decide on a final wording, but I have started the "transition" away from "marginal effects" towards... whatever. "marginal means" also has the shortcoming that for many models, you don't predict "means", but also probabilities etc.

I'm happy for suggestions regarding the wording! Btw, I never get used to Stata's "margin" command, and I find this kind of "(average) marginal effect" rather unintuitive, in particular when it comes to interaction terms. But that's more a personal thing here ;-)

vincentarelbundock commented 3 years ago

Cool cool, got it.

Yeah, it feels like a move away from the "marginal effects" language might be warranted, then. And unless I'm mixing things up, the Stats Exchange link highlights the wrong quantity, since it refers to the partial derivative, which is interpreted as a change in the function, not as the level of the conditional mean or adjusted prediction.

BTW, I've been hanging on to some rough code to produce ggplot2 representations of the output of the margins package, which calculates "true" marginal effects using auto-diff. This was designed as a drop-in replacement for the cplot function in that package, but my PR sat in the maintainer's repo for >1year, so I just spun it as a separate package.

Would require a fair amount of work, but it might be a nice addition:

https://github.com/vincentarelbundock/marginsplot

strengejacke commented 3 years ago

I will continue to adopt proper wording where necessary.

but my PR sat in the maintainer's repo for >1year, so I just spun it as a separate package.

Yeah, I think Thomas wanted to stop / reduce maintaining packages and some time ago asked for people who would like to take over the development of his packages.

Would require a fair amount of work, but it might be a nice addition:

I'll take a closer look at it. Regarding one of the plots (https://github.com/vincentarelbundock/marginsplot/blob/master/README_files/figure-gfm/cplot1-1.png), I think that plot has the same "error" like in Stata / margins, namely: https://strengejacke.github.io/ggeffects/articles/technical_stata.html

strengejacke commented 3 years ago

I have drafted a vignette, if you like, you may take a look at it and give some feedback here (https://strengejacke.github.io/ggeffects/articles/introduction_marginal_effects.html).

vincentarelbundock commented 3 years ago

The vignette looks fantastic! Awesome work!

I read it while eating some maple sugar, and I just have a few minor comments.

Title: Missing "R" in "Intoduction" and "to" instead of "into"

I'm not sure this sentence is correct:

"In essence, what ggpredict() returns, are not average marginal effects, but rather the marginal effects at different values of x."

I think of "marginal effect" as a difference between predicted values for (infinitesimal changes) in a predictor. In other words, the predicted values themselves are not marginal effects, unless we subtract them from one another. ggpredict seems to produce a table of predictions, not a table of differences.

ggpredict helps answer the question: what do I expect the outcome to be if X=1?

marginal effect helps answer the question: what is the effect of a 1 unit change in X on Y?

FWIW, I personally find it useful to define "average marginal effect" in more algorithmic/procedural terms.

1) Because of the non-linear link function, the marginal effect is different for each combination of predictor values (i.e., it varies from individual to individual. 2) Calculate the marginal effect of X on Y for each of the observed individuals of the dataset. 3) The mean of those individual marginal effects over the whole dataset is the "average" marginal effect.

It is useful to contrast this quantity to a related one: The Marginal Effect at the Mean. For this one we:

1) Calculate the mean of all the predictors in our dataset. 2) Calculate the marginal effect when all the predictors are held at their mean.

The two quantities can sometimes produce different results.

strengejacke commented 3 years ago

Thanks a lot for your feedback. I tried to include it into the vignette and revised several parts of it (currently pkgdown action is working, so should be up in a few minutes...).

I'm thinking about replacing "marginal effects" by a different term throughout the package. I think one lead distinctive feature is that "marginal effects" or "average marginal effects" are actually just one value, a kind of "adjusted coefficient". While ggeffects usually returns different (i.e. more than one) predictions (predicted averages of the outcome for different values of the predictor). In this sense, it would not be appropriate to talk about marginal effects at all, and I'm not sure if - what brms did recently when renaming marginal_effects() into conditional_effects() - using conditional effects would solve this issue (unless you say: marginal means marginalizing over the sample, resulting in one coefficient, while conditional means conditioning an different value of x). Maybe adjusted predictions is indeed the most general term here.

strengejacke commented 3 years ago

Thanks for your input here, this really helped me understand the meaning of "marginal effects". But still, I think I will not get used to the concept. Is the average marginal effects for a predictor a kind of "adjusted" coefficient? And if so, what would you prefer to show / interpret in logistic regression models: odds ratios or average marginal effects (and why)? E.g. following example, marginal_coefficients() from GLMMadaptive differs from what margins::margins() returns, probably these are not the same...

library(GLMMadaptive)
library(lme4)
library(margins)
data(Salamanders, package = "glmmTMB")

m <- mixed_model(mined ~ DOP + cover * spp ,
                 random = ~ 1|site,
                 family = binomial(),
                 data = Salamanders,
                 control = list(iter_EM = 0, nAGQ = 1))

m2 <- glmer(mined ~ DOP + cover * spp + (1|site), family = binomial(),
            data = Salamanders, control = glmerControl(nAGQ = 1))
dydx <- margins(m2, type = "link")
me <- sapply(dydx[grepl("^dydx_", colnames(dydx))], mean)

me <- data.frame(term = gsub("^dydx_(.*)", "\\1", names(me)), 
                 marg_eff = unname(me))

d <- data.frame(term = names(fixef(m)), 
                raw = fixef(m), 
                marg_coef = unlist(marginal_coefs(m)))

# coefficients and marginal effects on logit-scale
out <- merge(d, me, all = TRUE)
out
#>              term         raw  marg_coef     marg_eff
#> 1     (Intercept) 0.388314671 0.19281720           NA
#> 2           cover 3.389725658 1.49057395 19.441577541
#> 3  cover:sppDES-L 0.029829942 0.01497951           NA
#> 4     cover:sppDF 0.029829942 0.01497951           NA
#> 5     cover:sppDM 0.029829942 0.01497951           NA
#> 6   cover:sppEC-A 0.029829942 0.01497951           NA
#> 7   cover:sppEC-L 0.029829942 0.01497951           NA
#> 8     cover:sppPR 0.029829942 0.01497951           NA
#> 9             DOP 0.166579203 0.06782527  0.036891660
#> 10       sppDES-L 0.007208649 0.00371222 -0.008552426
#> 11          sppDF 0.007208649 0.00371222  0.074791953
#> 12          sppDM 0.007208649 0.00371222  0.085239315
#> 13        sppEC-A 0.007208649 0.00371222  0.060541670
#> 14        sppEC-L 0.007208649 0.00371222  0.015621926
#> 15          sppPR 0.007208649 0.00371222 -0.008672388

# coefficients and marginal effects on OR-scale
out[2:4] <- lapply(out[2:4], exp)
out
#>              term       raw marg_coef     marg_eff
#> 1     (Intercept)  1.474494  1.212661           NA
#> 2           cover 29.657815  4.439643 2.775683e+08
#> 3  cover:sppDES-L  1.030279  1.015092           NA
#> 4     cover:sppDF  1.030279  1.015092           NA
#> 5     cover:sppDM  1.030279  1.015092           NA
#> 6   cover:sppEC-A  1.030279  1.015092           NA
#> 7   cover:sppEC-L  1.030279  1.015092           NA
#> 8     cover:sppPR  1.030279  1.015092           NA
#> 9             DOP  1.181257  1.070178 1.037581e+00
#> 10       sppDES-L  1.007235  1.003719 9.914840e-01
#> 11          sppDF  1.007235  1.003719 1.077660e+00
#> 12          sppDM  1.007235  1.003719 1.088978e+00
#> 13        sppEC-A  1.007235  1.003719 1.062412e+00
#> 14        sppEC-L  1.007235  1.003719 1.015745e+00
#> 15          sppPR  1.007235  1.003719 9.913651e-01

^{Created on 2021-04-12 by the reprex package (v2.0.0)}

vincentarelbundock commented 3 years ago

Cool cool, I think our positions are converging. I just published an open access textbook (in French), and in it I use a simple example to distinguish the three quantities.

We model the probability of survival of passengers aboard the Titanic using logistic regression and 3 predictors: gender, age, and class of the travel cabin. I will use the margins and prediction packages, because those are the ones I'm used to working with. My substantive goal is to estimate the effect of buying a first class ticket on my chances of survival. Do I have better chances to live if I pay for a 1st class ticket than if I stay in 3rd?

Let's download data and create a factor variable to ensure that 3rd class is the reference category:

library(margins)
library(prediction)

dat = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/carData/TitanicSurvival.csv")
dat$survived = ifelse(dat$survived == "yes", 1, 0)
dat$class = factor(dat$passengerClass, c("3rd", "2nd", "1st"))
mod = glm(survived ~ sex + age + class, data = dat, family = binomial)

Adjusted predictions

prediction(mod, at = list(age = 25, sex = c("male", "female"), class = c("3rd", "1st")))
#> Data frame with 4184 predictions from
#>  glm(formula = survived ~ sex + age + class, family = binomial, 
#>     data = dat)
#> with average predictions:
#>  age    sex class      x
#>   25   male   3rd 0.1067
#>   25 female   3rd 0.5921
#>   25   male   1st 0.5410
#>   25 female   1st 0.9348

These "adjusted predictions" tell us that, on average, the probability of survival for a 25 y.o. man in 3rd class is 10.67%, the probability of survival for a 25 y.o. man in 1st is 54.10%, etc.

Marginal effects

margins(mod, variables = "class",
        at = list(age = 25, sex = "male"))
#> Average marginal effects at specified values
#> glm(formula = survived ~ sex + age + class, family = binomial,     data = dat)
#>  at(age) at(sex) class2nd class1st
#>       25    male   0.1401   0.4343

This "marginal effect" tells us that, on average, a 25 y.o. man who travels in 3rd class (the reference category) would have had a probability of survival 43.43 percentage points higher if he had bought a 1st class ticket instead.

Note that this marginal effect is exactly equal to the difference in probabilities we estimated above:

0.5410 - 0.1067
#> [1] 0.4343

Also note that if the variable of interest had been continuous, we wouldn't take a difference between levels like this. Instead, the estimated marginal effect would correspond to the partial derivative. But the logic is the same: marginal effects measure the association between a change in the predictors and a change in the outcome. It is an effect, not a prediction. It is a change, not a level.

In my own research, marginal effects are usually the main quantity of interest, and I find them very easy to reason about. I am interested in the Effect of buying a 1st class ticket. I am interested in the Effect of a political candidate's gender on the number of votes she receives. Those quantities are all marginal effects.

Odds ratios

Odds ratios are well-known.

exp(coef(mod))
#> (Intercept)     sexmale         age    class2nd    class1st 
#>  3.42949649  0.08226211  0.96619149  2.74310590  9.87158626

The odds of survival in 1st class are 10 times greater than the odds of survival in 3rd. The odds ratio answers a very similar scientific query as the marginal effects: is my outcome greater in 1st or in 3rd class? In that sense, I would put odds ratios and marginal effects on one side, and adjusted predictions on the other. However, I personally find odds ratios much harder to reason about than marginal effects.

In this case, the probabilities are very natural and easy to think about: buying a 1st class ticket increases my probability of surival by 43 percentage points. If I care about the "effect" of ticket class on survival, that's a straightforward answer. That's a straightforward probability.

In contrast, an "odd" is a transformation of the probability. And the odds ratio is a ratio of two transformed probability. So, for me, the intuition is much less clear. (The advantage of OR, of course, is it doesn't depend on the values of covariates like marginal effects.)

strengejacke commented 3 years ago

Cool cool, I think our positions are converging. I just published an open access textbook (in French), and in it I use a simple example to distinguish the three quantities.

Tagging @DominiqueMakowski - please translate the French book into English ;-)

DominiqueMakowski commented 3 years ago

I just published an open access textbook (in French)

It's awesome that the book is open-access ☺️ (also interesting thread thanks for tagging me in 😁)

strengejacke commented 3 years ago

@vincentarelbundock That was also a nice example to demonstrate the differences. Yet, I personally still think I don't like "(average) marginal effects" over "adjusted predictions".

for linear models, the AME = coefficient (no benefit)
for logistic regression with categorical predictor, the ME (also AME?) is the difference between two levels (you get the same with adjusted predictions, just need to do the math for subtraction)
for continuous predictors, esp. non-linear relationships, e.g. quadratic terms, the AME is the average slope across all values of that quadratic term - but what is that actually? Here I'd say that adjusted predictions (in particular graphing them) are much more intuitive.

Of course, my opinion might be biased because I usually associate AME with econometrics, and econometrics also believes that you should use FE regression instead of mixed models for longitudinal data or when group effects and fixed predictors correlate - although mixed models are actually much better. ;-)

vincentarelbundock commented 3 years ago

Ha! So we're down to "guilt by association"! 🤣

for continuous predictors, esp. non-linear relationships, e.g. quadratic terms, the AME is the average slope across all values of that quadratic term - but what is that actually? Here I'd say that adjusted predictions (in particular graphing them) are much more intuitive.

In that case, I think that standard practice is to plot a curve with "Marginal effect of X" on the y-axis and "X" on the x-axis. That curve tells us how the effect of X on Y changes for different values of X. Does X have diminishing or increasing "returns"?

Thanks for the link. I like the explanation a lot; very clear!

[edited out rest of my post, because I think I figured it out]

strengejacke commented 3 years ago

Ah, I just saw that my recent change (other point-geoms) actually didn't improve the plots at the end of that vignette, but rather made them almost disappear. I reverted the change, so the points should be visible again now. When the pkgdown page is rendered again (~10 minutes?), you should see the plots and the "between" effect should become clearer.

vincentarelbundock commented 3 years ago

Ah yeah, I see now. Much better.

In Model 2, it looks like you have random slopes in addition to just fixed effects (intercept shifts). Is that intended?

So what I take from these examples and discussion is that if you have a data-generating model like this:

y = bx + ku + e, where b is the effect of x on y and k is the strength of the unit effects u. And if you estimate a Mundlak-style model like this:

y ~ I(x - mean(x)) + I(mean(x))

Then the coefficient on the first term is equal to b, and the coefficient on the second term is equal to some combination of k and b. The former is interpreted as "within" and the latter as "between". Is that correct?

strengejacke commented 3 years ago

Model 2 is a simple linear model (the FE model), or which one did you mean?

m2 <- lm(y ~ 0 + x_within + grp, data = d)
model_parameters(m2)[1, ]
#> Parameter | Coefficient |   SE |       95% CI | t(99) |      p
#> --------------------------------------------------------------
#> x_within  |        1.20 | 0.07 | [1.06, 1.35] | 16.08 | < .001

vincentarelbundock commented 3 years ago

I meant that groups seem to have distinct intercepts and slopes in this graph. In the simple fixed effects model, I would have expected to see distinct intercepts, but identical slopes.

Screenshot 2021-04-13 084953

strengejacke commented 3 years ago

Then the coefficient on the first term is equal to b, and the coefficient on the second term is equal to some combination of k and b. The former is interpreted as "within" and the latter as "between". Is that correct?

Actually, the "within-between" model is different from the original Mundlak-specification. Mundlak modeled "contextual" effects, which are not the same as "between" effects. This is a newer development, based on what Mundlak found out, see the bullet-point in the vignette:

When time-varying predictors are “decomposed” into their time-varying and time-invariant components (demeaning), then mixed models can model both within- and between-subject effects (Bell, Fairbrother, and Jones 2019) - this approach is essentially a further development of a long-known recommendation by Mundlak (Mundlak 1978).

vincentarelbundock commented 3 years ago

oh cool. I'll read Bell et al.

strengejacke commented 3 years ago

I meant that groups seem to have distinct intercepts and slopes in this graph. In the simple fixed effects model, I would have expected to see distinct intercepts, but identical slopes.

Ah, ok. That's just the visuals. The model estimates fixed slopes for all groups. See also the addition:

This returns the coefficient of the “within”-effect, which is 1.2, with a standard error of 0.07. Note that the FE-model does not take the variation between subjects into account, thus resulting in (possibly) biased estimates, and biased standard errors.

strengejacke commented 3 years ago

I have attached some slides from my multilevel modeling course, maybe these are a bit clearer (I can provide the whole slides if anybody likes...) Multilevel_Modelling_in_R_Repeated_Measurements.pdf

strengejacke commented 3 years ago

(Reminder to respond to https://github.com/strengejacke/ggeffects/issues/66#issuecomment-786989241, which may fit into this discussion and which else gets lost in the closed issue)

strengejacke commented 1 year ago

Closing this, I think the vignettes for ggeffects are clearer now regarding terminology, and marginaleffects has plenty of helpful vignettes, too!

strengejacke / ggeffects

Definition of "marginal effect" #203

Adjusted predictions

Marginal effects

Odds ratios