strengejacke / ggeffects

Estimated Marginal Means and Marginal Effects from Regression Models for ggplot2
https://strengejacke.github.io/ggeffects
Other
558 stars 35 forks source link

ggpredict for GLM reports the SE from the predicted logit and not from the predicted response. #151

Open cschwarz-stat-sfu-ca opened 4 years ago

cschwarz-stat-sfu-ca commented 4 years ago

I'm using ggpredict() to estimate the response from the glm() but the output from ggpredict reports the se for the prediction on the logit scale and not the [0.1] scale. A MWE is attached.

ggeffects.bug.pdf

ggeffects.bug.r.txt

library(emmeans) library(ggeffects)

create some dummy data for a glm

mydata <- expand.grid(x1=seq(0,10,1), x2=seq(0,10,1)) mydata$logit.prob <- 0.5 + .1mydata$x1 - .2mydata$x2 mydata$prob <- 1/(1+exp(-mydata$logit.prob)) mydata$y <- rbinom(nrow(mydata), size=1, mydata$prob)

fit a glm

fit <- glm(y ~ x1 + x2, data=mydata, family=binomial(link=logit))

estimate marginal effects of x1 using emmeans

emmeans::ref_grid(fit, at=list(x1=c(4,5,6))) summary(.Last.ref_grid, infer=c(TRUE,FALSE)) summary(.Last.ref_grid, type="response",infer=c(TRUE,FALSE))

ggeffects

ggpredict(fit, terms = "x1 [4,5,6]", ci.lvl = 0.95)

notice that ggeffects estimates the probability at each level of x1 but the se is from the logit(probability) estimator.

the ci appears to be correct.

strengejacke commented 4 years ago

Thanks, usually a message is printed indicating that SE are on the link-scale, however, since some time, there seems to be a bug. Should be fixed so that now at least the message is printed again.

The reason why SEs are not transformed to the response scale is: 1) I'm not sure if this is done in the same way for all applicable models, i.e. exp(estimate) * SE; 2) I'm not sure if the SE on the response scale is always "meaningful", also in the sense of intuitive to interpret.

cschwarz-stat-sfu-ca commented 4 years ago

Thanks.... the SE needs to be back transformed according to the type of link function, i.e. the back transformation for a logit() link is different than the se for a log() link. Not that the reverse se for a logit() link (as in the glm) is NOT exp(estimate)*SE - this back transformation is only true for a log() link.

Why not just use the emmeans package function to do the heavy lifting for you as it "recognizes" all the major transformations automatically.

THere is nothing wrong with the SE on any scale... the only advantage is that central limit theorem that says that the sampling distribution of the estimate is approximately normal and so that computing the ci as estimate +/- multiplier x se performs better with smaller sample sizes on the link scale rather than the response scale. However, as sample sizes increase, everything works properly.

Carl Schwarz

On Sun, Aug 9, 2020 at 11:21 PM Daniel notifications@github.com wrote:

Thanks, usually a message is printed indicating that SE are on the link-scale, however, since some time, there seems to be a bug. Should be fixed so that now at least the message is printed again.

The reason why SEs are not transformed to the response scale is: 1) I'm not sure if this is done in the same way for all applicable models, i.e. exp(estimate) * SE; 2) I'm not sure if the SE on the response scale is always "meaningful", also in the sense of intuitive to interpret.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/strengejacke/ggeffects/issues/151#issuecomment-671182736, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXIXRSKTIUR7OQUEEJI4LTR76GU3ANCNFSM4PZQBHEA .

strengejacke commented 4 years ago

Ok, I have now changed the handling of standard errors, so these are now always back-transformed via the link-inverse function. Not sure, though, why emmeans returns the same SE on the link-scale, but back-transformed SE differ?

library(ggeffects)
mydata <- expand.grid(x1=seq(0,10,1), x2=seq(0,10,1))
mydata$logit.prob <- 0.5 + .1*mydata$x1 - .2*mydata$x2
mydata$prob       <- 1/(1+exp(-mydata$logit.prob))
mydata$y <- rbinom(nrow(mydata), size=1, mydata$prob)
# fit a glm
fit <- glm(y ~ x1 + x2, data=mydata, family=binomial(link=logit))

# estimate marginal effects of x1 using emmeans
emmeans::ref_grid(fit, at = list(x1 = c(4, 5, 6)))
#> 'emmGrid' object with variables:
#>     x1 = 4, 5, 6
#>     x2 = 5
#> Transformation: "logit"
summary(.Last.ref_grid,
        type = "response",
        infer = c(TRUE, FALSE))
#>  x1 x2  prob     SE  df asymp.LCL asymp.UCL
#>   4  5 0.547 0.0507 Inf     0.447     0.643
#>   5  5 0.571 0.0483 Inf     0.475     0.662
#>   6  5 0.595 0.0504 Inf     0.494     0.689
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale

ggpredict(fit, terms = "x1 [4,5,6]", ci.lvl = 0.95)
#> 
#> # Predicted probabilities of y
#> # x = x1
#> 
#> x | Predicted |   SE |       95% CI
#> -----------------------------------
#> 4 |      0.55 | 0.55 | [0.45, 0.64]
#> 5 |      0.57 | 0.55 | [0.48, 0.66]
#> 6 |      0.60 | 0.55 | [0.49, 0.69]
#> 
#> Adjusted for:
#> * x2 = 5.00

ggemmeans(fit, terms = "x1 [4,5,6]", ci.lvl = 0.95)
#> 
#> # Predicted probabilities of y
#> # x = x1
#> 
#> x | Predicted |   SE |       95% CI
#> -----------------------------------
#> 4 |      0.55 | 0.55 | [0.45, 0.64]
#> 5 |      0.57 | 0.55 | [0.48, 0.66]
#> 6 |      0.60 | 0.55 | [0.49, 0.69]
#> 
#> Adjusted for:
#> * x2 = 5.00

Created on 2020-08-11 by the reprex package (v0.3.0)

cschwarz-stat-sfu-ca commented 4 years ago

Unfortunately, you have NOT computed the back-transformed SE properly. The anti-logit transformation for the estimate and the confidence limits is computed as p=1/(1+exp(-theta)) where theta is the estimate or upper/lower ci, but the SE on the back-transformed scale is NOT done this way. According to the delta method, the back-transformation for the se will be SEp(1-p) where p is the back transformed estimate.

For example, the estimate on the logit scale at x1=4 and x2=5 is -0.112 (SE .197). The anti-logit of the estimate is p=1/(1+exp(-(-.112))=.472. The SE on the anti-logit scale is found as SE (on logit scale) p (1-p) = .197

Attached is the updated code and output for emmeans and you updated (but incorrect) latest version of ggeffects from the github site.

I'm not sure what you mean by the CI differing? They appears to match after rounding?

On Tue, Aug 11, 2020 at 2:53 AM Daniel notifications@github.com wrote:

Ok, I have now changed the handling of standard errors, so these are now always back-transformed via the link-inverse function. Not sure, though, why emmeans returns the same SE on the link-scale, but back-transformed CI differ?

library(ggeffects)mydata <- expand.grid(x1=seq(0,10,1), x2=seq(0,10,1))mydata$logit.prob <- 0.5 + .1mydata$x1 - .2mydata$x2mydata$prob <- 1/(1+exp(-mydata$logit.prob))mydata$y <- rbinom(nrow(mydata), size=1, mydata$prob)# fit a glmfit <- glm(y ~ x1 + x2, data=mydata, family=binomial(link=logit))

estimate marginal effects of x1 using emmeansemmeans::ref_grid(fit, at = list(x1 = c(4, 5, 6)))#> 'emmGrid' object with variables:#> x1 = 4, 5, 6#> x2 = 5#> Transformation: "logit"

summary(.Last.ref_grid, type = "response", infer = c(TRUE, FALSE))#> x1 x2 prob SE df asymp.LCL asymp.UCL#> 4 5 0.547 0.0507 Inf 0.447 0.643#> 5 5 0.571 0.0483 Inf 0.475 0.662#> 6 5 0.595 0.0504 Inf 0.494 0.689#> #> Confidence level used: 0.95 #> Intervals are back-transformed from the logit scale

ggpredict(fit, terms = "x1 [4,5,6]", ci.lvl = 0.95)#> #> # Predicted probabilities of y#> # x = x1#> #> x | Predicted | SE | 95% CI#> -----------------------------------#> 4 | 0.55 | 0.55 | [0.45, 0.64]#> 5 | 0.57 | 0.55 | [0.48, 0.66]#> 6 | 0.60 | 0.55 | [0.49, 0.69]#> #> Adjusted for:#> * x2 = 5.00

ggemmeans(fit, terms = "x1 [4,5,6]", ci.lvl = 0.95)#> #> # Predicted probabilities of y#> # x = x1#> #> x | Predicted | SE | 95% CI#> -----------------------------------#> 4 | 0.55 | 0.55 | [0.45, 0.64]#> 5 | 0.57 | 0.55 | [0.48, 0.66]#> 6 | 0.60 | 0.55 | [0.49, 0.69]#> #> Adjusted for:#> * x2 = 5.00

Created on 2020-08-11 by the reprex package https://reprex.tidyverse.org (v0.3.0)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/strengejacke/ggeffects/issues/151#issuecomment-671848980, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXIXRREXBVWXWDC5PRSB33SAEIIXANCNFSM4PZQBHEA .

cschwarz-stat-sfu-ca commented 4 years ago

ggeffects.bug.pdf ggeffects.bug.txt

strengejacke commented 4 years ago

Damn, I thought the link-inverse would do the trick, but I got you wrong then. CI was a typo, I meant SE. I'll revert the changes and stick to the message that SEs are on the link-scale, until I figure out which transformation to use for which kind of model.

cschwarz-stat-sfu-ca commented 4 years ago

I think it is dangerous to leave SE for the untransformed data in the table. The note at the bottom of the table will invariably get lost/overlooked. Users will ask how come the 95% ci is not approximately estimate +/- 2 se etc It would be safer to "blank" out the SE column after the back-transformation.

Alternatives... (a) let emmeans do the heavy lifting for you. It does pretty much what you are attempting to do in the ggeffects package and you just need to repackage the results. You likely were unaware of how to set the reference grid in emmeans, but see my example code. (b) use the msm package to find the se after the transformation. You feed in the estimate/se, the function, and it computes the back-transformed estimate and the proper se for you.

Carl

On Tue, Aug 11, 2020 at 1:00 PM Daniel notifications@github.com wrote:

Damn, I thought the link-inverse would do the trick, but I got you wrong then. CI was a typo, I meant SE. I'll revert the changes and stick to the message that SEs are on the link-scale, until I figure out which transformation to use for which kind of model.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/strengejacke/ggeffects/issues/151#issuecomment-672248213, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXIXRWXDRHMVD4HPFUN64LSAGPPHANCNFSM4PZQBHEA .

strengejacke commented 4 years ago

a) does not completely solve the issue, because ggpredict() and ggemmeans() may return different results (see here). I'm aware of the ref-grid option, as I'm using it in ggemmeans().

b) I'll look into that package. Maybe I re-implement it in an own solution to keep dependencies small.

cschwarz-stat-sfu-ca commented 4 years ago

The reason for the difference is that emmeans gives MARGINAL means while ggpredict is giving predictions for a SINGLE level of a categorical variable.. What is a marginal mean -- as you note, that if you don't specify a level for a categorical factor, emmeans will average over the responses for each level. ggpredict holds the categorical variables fixed at the reference level. Both are "correct" but have different interpretations and cannot be directly compared to each other and so the answers will differ.

For example, suppose you fit a model lm( bloodpressure ~ sex + minutes.of.exercise.per.week....) where sex is categorical with 2 levels (M/F) and minutes of exercise per week is continuous.

If you ask for a prediction of mean blood pressure at 22 minutes of exercise/week, emmeans will average over the two sexes assuming a 50:50 sex ratio (the weighting can be changed); ggpredict will give the predicition for females only. Both are correct, but are not directly comparable.

You can also get emmeans to make predictions at the particular level if you are so interested.

Carl.

On Tue, Aug 11, 2020 at 11:27 PM Daniel notifications@github.com wrote:

a) does not completely solve the issue, because ggpredict() and ggemmeans() may return different results (see here https://strengejacke.github.io/ggeffects/articles/technical_differencepredictemmeans.html). I'm aware of the ref-grid option, as I'm using it in ggemmeans().

b) I'll look into that package. Maybe I re-implement it in an own solution to keep dependencies small.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/strengejacke/ggeffects/issues/151#issuecomment-672636167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXIXRVIBLEPDCTEWAE2TITSAIY55ANCNFSM4PZQBHEA .

strengejacke commented 4 years ago

Yes, my point was that due to the fact that predict and emmeans may return different results, I cannot use emmeans for the SEs from the ggpredict() function, which relies on predict(). Thus, my intention is to find a more generic way or own implementation of transforming SEs.

msm's deltamethod() seems to require a formula, based on the model family / link function, but that is actually the missing information for me here: what formula for transforming SEs is required for which model? I'll leave this issue open, maybe I can find a solution somewhen...