strengejacke / ggeffects

Estimated Marginal Means and Marginal Effects from Regression Models for ggplot2
https://strengejacke.github.io/ggeffects
Other
548 stars 35 forks source link

Incorrect estimates, SE, and CI with nlme::lme() #295

Open DrJerryTAO opened 1 year ago

DrJerryTAO commented 1 year ago

Hi @strengejacke

There are a few problems with ggeffects() of a lme() model

  1. Given the latest {insight}, ggeffects sometimes produce errors with lme() if its weights = varFunc() is specified.
  2. ggpredict() holds the cluster index at a specific level even when type = "fixed" for a population estimate. It gives incorrect estimates and SE when type = "random" and cluster index is specified in condition = ...: cluster specific estimates and SEs do not vary by cluster.
  3. ggpredict() and all other functions do not account for population variance changing by some predictors. SE calculation does not include heteroscedasticity consideration when weights = varFunc() is specified.
  4. ggemmeans() gives the same point estimates but different SE than ggpredict(). Like ggpredict(), cluster-specific estimates and SEs are incorrect and do not vary by cluster; weights = varFunc() in lme() is not correctly used.
  5. ggeffect() incompatible with insight::find_weights() when weights = varFunc() is specified in lme(), making a warning SOMETIMES, probably when weights = varFunc() includes more than one predictors for heteroscedasticity.
  6. ggeffect() does not make cluster-specific estimates and SEs are incorrect just like ggpredict() and ggemmeans(). Not sure if it is intended, but average marginal effects could also been assessed within each cluster separately.

Some similar issues were reported in {marginaleffects} at https://github.com/vincentarelbundock/marginaleffects/issues/99#issuecomment-1465794353

Model set up. {insight} find and get seem to work well.

library(nlme)
library(insight)
library(ggeffects)
data("Orthodont", package = "nlme")
ggplot( # spaghetti plot with LOESS group mean
  data = Orthodont, 
  aes(x = age, y = distance, group = Subject, color = Sex, fill = Sex)) + 
  geom_smooth(aes(
    group = Sex), alpha = 0.6, method = "loess", span = 1, show.legend = F) + 
  geom_line(size = 0.2, alpha = 0.6) +
  scale_color_manual(
    breaks = c("Female", "Male"), 
    values = c("#D14343", "#2197DF")) +
  scale_fill_manual(
    breaks = c("Female", "Male"), 
    values = c("#D14343", "#2197DF")) +
  scale_x_continuous(breaks = c(8, 10, 12, 14), minor_breaks = NULL) + 
  labs(x = "Age (year)", y = "Distance") + 
  theme_minimal() + 
  theme(
    strip.placement = "outside", 
    strip.text = element_text(size = rel(1)), 
    legend.title = element_blank(), 
    legend.position = c(0.1, 0.7), # "bottom"
    legend.direction = "vertical", legend.box = "vertical", 
    legend.margin = margin(0, 0, 0, 0), 
    legend.spacing = unit(0, "line"), 
    legend.background = element_blank(),
    legend.text = element_text(size = rel(0.8)),
    legend.key = element_blank(), 
    legend.key.size = unit(0.8, "line"),
    legend.key.width = unit(1.2, "line"),
    panel.grid.major.y = element_line(colour = "grey90", size = 0.2), 
    panel.grid.minor.y = element_line(colour = "grey90", size = 0.2), 
    panel.grid.major.x = element_line(colour = "grey90", size = 0.2), 
    panel.grid.minor.x = element_line(colour = "grey90", size = 0.2)
  )
summary(Model <- lme( # using all formula components
  distance ~ age, data = Orthodont, # fixed effect
  random = list(~ 1 + age | Subject), # correlated random int + age
  correlation = corAR1( # error correlation over time within Subject
    form = ~ as.integer(as.factor(age)) | Subject), # 1:4 as time index
  weights = varIdent(form = ~ 1 | Sex))) # heterosc: M/F has diff var
"       AIC      BIC    logLik
  436.6792 457.9867 -210.3396

Random effects:
 Formula: ~1 + age | Subject
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 2.3152914 (Intr)
age         0.2402225 -0.619
Residual    1.5037380       

Correlation Structure: AR(1)
 Formula: ~as.integer(as.factor(age)) | Subject 
 Parameter estimate(s):
       Phi 
-0.2325323 

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | Sex 
 Parameter estimates:
    Male   Female 
1.000000 0.401668 

Fixed effects:  distance ~ age 
                Value Std.Error DF t-value p-value
(Intercept) 17.115492 0.6384924 80 26.8061       0
age          0.624408 0.0617834 80 10.1064       0"
vcov(Model) # only among fixed effect coefficients
"            (Intercept)          age
(Intercept)   0.4599040 -0.024615698
age          -0.0246157  0.002237791"
find_predictors(Model)
'$conditional
[1] "age"
$correlation
[1] "age"     "Subject"'
find_predictors(Model, effects = "all", component = "all")
'$conditional
[1] "age"
$random
[1] "Subject"
$correlation
[1] "age"     "Subject"'
find_variables(Model)
'$response
[1] "distance"
$conditional
[1] "age"
$random
[1] "Subject"
$correlation
[1] "age"     "Subject"'
find_weights(Model) # "varIdent(form = ~1 | Sex)" before fix
'"Sex"'
class(find_weights(Model)) # "character"
names(get_data(Model)) # Sex left out before fix
'"distance" "age"      "Subject"  "Sex"'

predict.lme() as a reference. It has a level = c() argument to include or not random effects by clustering level.

# Prediction using nlme
Orthodont[Orthodont$Subject %in% c("M01", "F01"), ]
"Grouped Data: distance ~ age | Subject
   distance age Subject    Sex
1      26.0   8     M01   Male
2      25.0  10     M01   Male
3      29.0  12     M01   Male
4      31.0  14     M01   Male
65     21.0   8     F01 Female
66     20.0  10     F01 Female
67     21.5  12     F01 Female
68     23.0  14     F01 Female"
predict(Model, newdata = expand.grid(
  age = c(8, 10, 12, 14), Sex = c("Male", "Female")), 
  level = 0) # mean of population. No SE or CI option.
# Male has the same prediction as Female: Sex is only in weights
"[1] 22.11075 23.35957 24.60839 25.85720 22.11075 23.35957 24.60839
[8] 25.85720"
predict(Model, newdata = Orthodont[
  Orthodont$Subject %in% c("M01", "F01"), ], 
  level = 0) # same result as above
predict(Model, newdata = expand.grid(
  age = c(8, 10, 12, 14), Sex = c("Male", "Female")), 
  level = 1) # must supply cluster variables
"Error in predict.lme: 
cannot evaluate groups for desired levels on 'newdata'"
predict(Model, newdata = expand.grid( # level = max() by default
  age = c(8, 10, 12, 14), Sex = c("Male", "Female"))) # same error
predict(Model, newdata = Orthodont[
  Orthodont$Subject %in% c("M01", "F01"), ], 
  level = 1) # different when including random effects
"     M01      M01      M01      M01      F01      F01      F01 
24.77898 26.53948 28.29998 30.06047 20.10894 20.92936 21.74977 
     F01 
22.57018"
predict(Model, newdata = Orthodont[
  Orthodont$Subject %in% c("M01", "F01"), ], 
  level = 0:1) # level can be a vector, 0 for population
"   Subject predict.fixed predict.Subject
1      M01      22.11075        24.77898
2      M01      23.35957        26.53948
3      M01      24.60839        28.29998
4      M01      25.85720        30.06047
65     F01      22.11075        20.10894
66     F01      23.35957        20.92936
67     F01      24.60839        21.74977
68     F01      25.85720        22.57018"

ggeffects results. See comments inside

# Marginal effects using {ggeffects}
ggpredict( # Error probably due to weights = varFunc() not a numeric vector
  Model) # but a function of a factor Sex to be evaluated with var coef
"Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  ‘sum’ not meaningful for factors
In addition: Warning message:
In Ops.factor(x, w) : ‘*’ not meaningful for factors"
ggpredict(Model, terms = c("Sex")) # same error
"`Sex` was not found in model terms. Maybe misspelled?
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  ‘sum’ not meaningful for factors
In addition: Warning message:
In Ops.factor(x, w) : ‘*’ not meaningful for factors"
ggpredict(Model, terms = c("age")) # Worked
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.33, 22.89]
 10 |     23.36 | [22.54, 24.18]
 12 |     24.61 | [23.69, 25.52]
 14 |     25.86 | [24.79, 26.92]
Adjusted for:
* Subject = M16"
ggpredict( # same; fe by default, but why hold Subject = M16?? Should be
  Model, terms = c("age"), type = "fixed") # population; misleading notes
ggpredict(Model, terms = c("age"), type = "random") # larger SE wider CI
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [19.02, 25.20]
 10 |     23.36 | [20.26, 26.46]
 12 |     24.61 | [21.48, 27.74]
 14 |     25.86 | [22.68, 29.03]
Adjusted for:
* Subject = M16"
ggpredict(Model, terms = c("age", "Sex")) # SE/CI NA on some rows
# Warning `Sex` not found, but female got wider CI? Incorrect SE calculation
"`Sex` was not found in model terms. Maybe misspelled?
# Sex = Male
age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.33, 22.89]
 10 |     23.36 | [22.44, 24.28]
 12 |     24.61 |               
 14 |     25.86 |               
# Sex = Female
age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.30, 22.93]
 10 |     23.36 | [22.30, 24.42]
 12 |     24.61 |               
 14 |     25.86 |               
Adjusted for:
* Subject = M16"
ggpredict(Model, terms = c("age"), condition = c(Sex = "Male")) # worked
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.33, 22.89]
 10 |     23.36 | [22.54, 24.18]
 12 |     24.61 | [23.69, 25.52]
 14 |     25.86 | [24.79, 26.92]
Adjusted for:
* Subject = M16"
ggpredict( # same SE/CI as Male, but should be smaller due to weights=~
  Model, terms = c("age"), condition = c(Sex = "Female")) 
ggpredict(Model, terms = c("age"), type = "fixed", condition = c(
  Subject = "M01")) # population mean despite subject ID. Correct
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.33, 22.89]
 10 |     23.36 | [22.54, 24.18]
 12 |     24.61 | [23.69, 25.52]
 14 |     25.86 | [24.79, 26.92]"
ggpredict(Model, terms = c("age"), type = "random", condition = c(
  Subject = "M01")) # individual mean same as population mean? Incorrect!!!
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [19.02, 25.20]
 10 |     23.36 | [20.26, 26.46]
 12 |     24.61 | [21.48, 27.74]
 14 |     25.86 | [22.68, 29.03]
Intervals are prediction intervals."
ggpredict(Model, terms = c("age"), type = "random", condition = c(
  Subject = "M02")) # indivudal mean/CI does not vary by ID. Incorrect!!!
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [19.02, 25.20]
 10 |     23.36 | [20.26, 26.46]
 12 |     24.61 | [21.48, 27.74]
 14 |     25.86 | [22.68, 29.03]"

ggemmeans(Model, terms = c("Subject [all]")) 
"Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  ‘sum’ not meaningful for factors
In addition: Warning message:
In Ops.factor(x, w) : ‘*’ not meaningful for factors"
ggemmeans( # CI reflects a larger SE than what ggpredict() used 
  Model, terms = c("age")) # Unclear if larger Male SE; or the average by Sex
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.30, 22.92]
 10 |     23.36 | [22.52, 24.20]
 12 |     24.61 | [23.66, 25.55]
 14 |     25.86 | [24.76, 26.96]"
ggemmeans( # Also wider re CI than ggpredict; unclear SE calculation!!!
  Model, terms = c("age"), type = "random") 
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [18.31, 25.91]
 10 |     23.36 | [19.52, 27.19]
 12 |     24.61 | [20.67, 28.55]
 14 |     25.86 | [21.77, 29.95]"
ggemmeans( # same results as above. Incorrect pred and SE!!!
  Model, terms = c("age"), type = "random", condition = c(Subject = "M01")) 
ggemmeans( # same results as above. Incorrect pred and SE!!!
  Model, terms = c("age"), type = "random", condition = c(Subject = "F01"))
ggemmeans(Model, terms = c("age", "Sex")) # Error on Sex
"`Sex` was not found in model terms. Maybe misspelled?
Can't compute estimated marginal means, 'emmeans()' returned error.
Reason: No variable named Sex in the reference grid
You may try 'ggpredict()' or 'ggeffect()'."
ggemmeans( # no change in SE CI. Incorrect SE wrt varFunc
  Model, terms = c("age"), condition = c(Sex = "Female"))
ggemmeans(Model, terms = c("age"), type = "simulate") # error

ggeffect(Model, terms = c("age")) # same as ggpredict()
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.33, 22.89]
 10 |     23.36 | [22.55, 24.17]
 12 |     24.61 | [23.70, 25.52]
 14 |     25.86 | [24.80, 26.92]"
# Based on a different model, possible incompatibility between ggeffect()  
# and insight::find_weights when weights=~ used in lme()
"Warning message:
In !is.null(weighting_var) && !weighting_var %in% colnames(mf) :
  'length(x) = 3 > 1' in coercion to 'logical(1)'"
ggeffect(Model, terms = c("age"), type = "re") # same CI as fe? Incorrect!!!
"age | Predicted |         95% CI
--------------------------------
  8 |     22.11 | [21.33, 22.89]
 10 |     23.36 | [22.55, 24.17]
 12 |     24.61 | [23.70, 25.52]
 14 |     25.86 | [24.80, 26.92]"
ggeffect(Model, terms = c("age", "Sex")) # Error: missing model term
"`Sex` was not found in model terms. Maybe misspelled?
Can't compute marginal effects, 'effects::Effect()' returned an error.
Reason: the following predictor is not in the model: Sex
You may try 'ggpredict()' or 'ggemmeans()'."
DrJerryTAO commented 1 year ago

For Problem 1, it happens whenever weights = ~ involves multiple variables, because weighting_var is expected to have length 1. See a simple case below.

Without mean structure predictor

data("Orthodont", package = "nlme")
summary(Model <- lme( # a model of variance only
  distance ~ 1, data = Orthodont, # grand mean
  weights = varConstPower(form = ~ age | Sex))) # var = b0 + |age|^b1
"       AIC      BIC    logLik
  514.5723 533.2821 -250.2861

Random effects:
 Formula: ~1 | Subject
        (Intercept) Residual
StdDev:     1.84895 0.109692

Variance function:
 Structure: Constant plus power of variance covariate, different strata
 Formula: ~age | Sex 
 Parameter estimates:
              Male     Female
const 0.0001144912 13.0138472
power 1.3237960288 -0.4748515

Fixed effects:  distance ~ 1 
               Value Std.Error DF  t-value p-value
(Intercept) 23.42394 0.4046777 81 57.88296       0"
find_predictors(Model) # NULL
find_predictors(Model, effects = "all", component = "all") # NULL
find_variables(Model)
'$response
[1] "distance"'
find_weights(Model) # correct
'"age" "Sex"'
formula(Model$modelStruct$varStruct) # ~age | Sex
names(get_data(Model)) # okay

ggpredict(Model) # empty list!!! Should give a grand mean
'list()
attr(,"class")
[1] "ggalleffects" "list"        
attr(,"model.name")
[1] "Model"'
ggemmeans(Model) # error, but might be as expected
'"Error: `terms` needs to be a character vector with at least one
  predictor name: one term used for the x-axis, more optional
  terms as grouping factors.
In addition: Warning message:
In !is.null(weighting_var) && !weighting_var %in% colnames(mf) :
  length(x) = 2 > 1 in coercion to logical(1)'
ggeffect(Model) # NULL, but might be as expected

With mean structure predictor

summary(Model <- lme( # a model of variance of multiple terms
  distance ~ age, data = Orthodont, # mean change with age
  weights = varConstPower(form = ~ age | Sex))) # var = b0 + |age|^b1
"       AIC      BIC    logLik
  439.3828 466.0172 -209.6914

Random effects:
 Formula: ~age | Subject
 Structure: General positive-definite
            StdDev     Corr  
(Intercept)  1.6329350 (Intr)
age          0.1974838 -0.387
Residual    10.8517576       

Variance function:
 Structure: Constant plus power of variance covariate, different strata
 Formula: ~age | Sex 
 Parameter estimates:
            Male        Female
const  0.1168654  9.637857e-05
power -1.4312273 -1.194815e+00

Fixed effects:  distance ~ age 
                Value Std.Error DF  t-value p-value
(Intercept) 17.268816 0.6113480 80 28.24711       0
age          0.610175 0.0589014 80 10.35926       0"
find_predictors(Model)
'$conditional
[1] "age"'
find_predictors(Model, effects = "all", component = "all") # same
find_variables(Model)
'$response
[1] "distance"
$conditional
[1] "age"'
find_weights(Model) # correct
'"age" "Sex"'
names(get_data(Model)) # okay

ggpredict(Model) # no CI!!!
'Error: Confidence intervals could not be computed.
$age
# Predicted values of distance
age | Predicted
---------------
  8 |     22.15
 10 |     23.37
 12 |     24.59
 14 |     25.81
attr(,"class")
[1] "ggalleffects" "list"        
attr(,"model.name")
[1] "Model"
Warning messages:
1: In !is.null(weighting_var) && !weighting_var %in% colnames(mf) :
  length(x) = 2 > 1 in coercion to logical(1)
2: In colnames(datlist)[ncol(datlist)] <- w :
  number of items to replace is not a multiple of replacement length'
ggemmeans(Model) # same error as previous model, but might be as expected
"Error: `terms` needs to be a character vector with at least one
  predictor name: one term used for the x-axis, more optional
  terms as grouping factors.
In addition: Warning message:
In !is.null(weighting_var) && !weighting_var %in% colnames(mf) :
  'length(x) = 2 > 1' in coercion to 'logical(1)'"
ggeffect(Model) # has CI but warning
'age | Predicted |         95% CI
--------------------------------
  8 |     22.15 | [21.37, 22.93]
 10 |     23.37 | [22.55, 24.19]
 12 |     24.59 | [23.68, 25.50]
 14 |     25.81 | [24.76, 26.87]

attr(,"class")
[1] "ggalleffects" "list"        
attr(,"model.name")
[1] "Model"
Warning message:
In !is.null(weighting_var) && !weighting_var %in% colnames(mf) :
  length(x) = 2 > 1 in coercion to logical(1)'

SE and CI in the last ggeffect(Model) look incorrect: weight = varFunc() coefficient estimates say that population SE and CI width should reduce with age in a power function, whereas ggeffect() CI widens with age.

DrJerryTAO commented 1 year ago

Further note: in a different example, ggemmeans(lme(), type = "random") calculated SE much much larger than ggpredict() had (before recent updates, not ggpredict does not give any SE).