strengejacke / ggeffects

Estimated Marginal Means and Marginal Effects from Regression Models for ggplot2
https://strengejacke.github.io/ggeffects
Other
558 stars 35 forks source link

Question about sources of uncertainty when estimatinc confidence intervals #406

Closed xfim closed 11 months ago

xfim commented 12 months ago

Dear all,

This may be a more general question about the meaning of "marginal effects", or even about the meaning of "expected values" more than about the package per-se, so pardon me if you consider that this is not the venue. But the issue has arised when working with the package.

I have observed that when using ggpredict() for a simple linear regression model with an interaction (y ~ a * b) the outcome of ggpredict() is a point-estimade prediction plus, and also a confidence interval that is later used in the context of plotting the uncertainty bands. While the value of the point-estimate prediction is quite clear to me (predicted value = a.coef a.to.predict + b.coef b.to.predict + ab.interaction.coef a.to.predict b.to.predict), the implications for the calculation of the confidence interval are less so.

I would expect the confidence interval of the prediction to include the uncertainty (SE) not only of a.coef, but also of b.coef and ab.interaction.coef. So something like: predicted.with.ci = (a.coef ± a.SE) * a.to.predict + (b.coef ± b.SE) * b.to.predict + (ab.interaction.coef ± ab.interaction.SE) * a.to.predict + b.to.predict

But from what I can see in the code, and as far as I can understand it, it seems that this is not the case, and that the uncertainty only considers one of the coefficients, so something like: predicted.with.ci = (a.coef ± a.sd) * a.to.predict + b.coef * b.to.predict + ab.interaction.coef * a.to.predict + b.to.predict

Is that the case? If so, should the confidence intervals associated to expected/predicted values include uncertainty for all the coefficients involved?

Thank you.

strengejacke commented 12 months ago

I think you are referring to the difference between confidence and prediction intervals. The latter take the model uncertainty (sigma) into account, to reflect the uncertainty for predicting new, formerly unknown (or: not in the data) values.

library(ggeffects)
data(efc)
efc <- datawizard::to_factor(efc, "e42dep")
fit <- lm(barthtot ~ e42dep, data = efc)

ggpredict(fit, "e42dep")
#> # Predicted values of Total score BARTHEL INDEX
#> 
#> e42dep               | Predicted |         95% CI
#> -------------------------------------------------
#> independent          |     94.77 | [90.40, 99.14]
#> slightly dependent   |     86.19 | [83.83, 88.56]
#> moderately dependent |     73.33 | [71.30, 75.37]
#> severely dependent   |     32.63 | [30.58, 34.68]

ggpredict(fit, "e42dep", interval = "prediction")
#> # Predicted values of Total score BARTHEL INDEX
#> 
#> e42dep               | Predicted |          95% CI
#> --------------------------------------------------
#> independent          |     94.77 | [59.30, 130.24]
#> slightly dependent   |     86.19 | [50.91, 121.47]
#> moderately dependent |     73.33 | [38.07, 108.59]
#> severely dependent   |     32.63 | [-2.63,  67.89]
#> 
#> Intervals are prediction intervals. Use `interval = "confidence"` to
#>   return regular confidence intervals.

Created on 2023-11-21 with reprex v2.0.2

The standard errors themselves are provided by predict(), however, if no standard errors available, ggpredict() tries to calculate them manually. You find some details here: https://strengejacke.github.io/ggeffects/articles/ggeffects.html#short-technical-note

It's actually the same as for the standard errors that are directly returned by predict().

Maybe these two resources are helpful: https://www.indeed.com/career-advice/career-development/prediction-interval-vs-confidence-interval https://stats.stackexchange.com/q/16493/54740

xfim commented 12 months ago

Thank you @strengejacke .

I was indeed refering to the expected values (confidence in your notation, the ones only taking into account the uncertainty in the estimators), by contrast to the predicted values (here we coincide, and the ones taking also into consideration what we don't know about the model and the data).

So my question was refering only, exclusively, to the first ones. Even without taking into account the uncertainty of the model, when we want to calculate an expected value with its confidence interval, it seems that only the variable at hand is taken into consideration.

I am asking because in contrast with a pure Bayesian inference, one would expect that the uncertainty of the expected values takes into account the different credible intervals not only related to a.coef, but also to b.coef and ab.interaction.coef. However, I have checked stan_glm() and it seems that it also only takes a.coef into account. When I estimate it using JAGS, if I want an expected value, I need to take into account the 3 of them. So that was the origin of my question and the differences that I find in how ggeffects behaves (even in a Bayesian world) and what I do manually.

Is it a bit clearer? And, again, if this is not the correct venue I will ask in a more convenient one.

Thank you.