make emmeans more robust to different parameter names

mattansb commented 2 years ago

This PR is meant to allow emmeans support to be robust to the term order in an interaction term in lavaan.

For example, now emmeans will give the same results for all 4 interaction regressions:

y ~ a + b + a:b
y ~ a + b + b:a
y ~ b + a + a:b
y ~ b + a + b:a

mattansb commented 2 years ago

Hey, I'm still figuring out if this is the best route to take here (see discussion here https://github.com/rvlenth/emmeans/issues/342). I have a holiday coming up, so I'll probably only get back to this later this month / early May.

Either way, I will add something more "actionable" in the warning (what ever it may be).

Thanks @TDJorgensen !

TDJorgensen commented 2 years ago

Okay, now I've caught up on the conversation with rvlenth. I agree with his last point that it is purely a practical issue for the automatic values emmeans chooses for factors (all observed levels) and covariates (the mean). In that case, perhaps the simplest solution is to extract lavInspect(fit, "data"), extract the columns that are relevant (all included in terms and the lavaan.DV=), then apply a function to each column that replaces any NA with the mean(..., na.rm=TRUE).

I forget whether the only factor variable recognized in lavaan2emmeans is the group= variable in a multigroup model. Is there a way to specify that a set of dummy codes or other contrasts are really representing a single grouping variable? If not, then I don't see any complications that could arise, except that lavaan can also be fitted to summary statistics rather than raw data (in which case, at least missing data is not an issue, and perhaps simulating a single data set with MASS::mvrnorm() would suffice).

TDJorgensen commented 2 years ago

Upon further inspection, I think your .emlav_recover_data() function could be drastically simplified using new lavPredict() arguments that automatically append original data to factor scores and combine multiple groups into a single data.frame. See this example:

library(lavaan)
HS <- HolzingerSwineford1939
HS$x1[1:5] <- NA
foo <- sem('x1 ~ x2 + x3', data = HS, group = "school",
           meanstructure = TRUE, missing = "FIML")
dat <- lavPredict(foo, assemble = TRUE, append.data = TRUE)
head(dat)

Note that it even works with (multigroup) path models without latent factors, because nothing needs to be appended to the fitted data.

I think the change below would solve the missing-data problem, too. Please let me know whenever you have a chance to get back to this.

.emlav_recover_data <- function(object){
  dat <- lavaan::lavPredict(object, type = "lv",
                            assemble = TRUE,
                            append.data = TRUE)
  ## mean-impute any NAs
  for (i in seq_along(dat)) {
    ## ignore non-numeric variables
    if (!is.numeric(dat[,i])) next
    ## any NAs?
    idx.na <- which(is.na(dat[,i]))
    if (length(idx.na)) {
      dat[idx.na, i] <- mean(dat[,i], na.rm = TRUE)
    }
  }
  ## convert to data.frame, if necessary (single group)
  as.data.frame(dat)
}

TDJorgensen commented 2 years ago

Hi @mattansb, just checking in.

While working on a teaching example, I came across an odd error with multigroup models. Messing around narrowed it down to the case when sample sizes were equal. Adapting your help-page example to have 100 from each school reproduces the problem:

HS <- rbind(head(HolzingerSwineford1939, 100), tail(HolzingerSwineford1939, 100))
model <- 'x1 ~ c(int1, int2)*1 + c(b1, b2)*ageyr
  diff_11 := (int2 + b2*11) - (int1 + b1*11)
  diff_13 := (int2 + b2*13) - (int1 + b1*13)
  diff_15 := (int2 + b2*15) - (int1 + b1*15)
'
semFit <- sem(model, group = "school", data = HS)
emmeans(semFit, pairwise ~ school | ageyr, lavaan.DV = "x1",
        at = list(ageyr = c(11, 13, 15)), nesting = NULL)[[2]]

The problem is in .emlav_recover_data(), when using unlist(group_labels). With equal sample sizes, the sapply() call that creates group_labels will return a matrix rather than a list, unless you add the argument simplify = FALSE. That's a quick fix I can make to the main branch, but if you are going to try to close this issue soon anyway, perhaps you could just add it to your pull request. No pressure, just let me know.

Thanks!

mattansb commented 2 years ago

Hi @TDJorgensen

Happy to hear you're using this emmeans functionality! I think it would be best for you to make that fix on main - I'm currently busy writing my dissertation, so this PR has been put on the back burner (along with about 1000 more things).

(I hope to get back to this and my other R-dev activities sometime next month.)

TDJorgensen commented 2 years ago

Okay, I copied your changes from the first commit on this PR, to make it robust to different ordering of names. Then I just used my suggested .emlav_recover_data() change, which avoids the missing-data problems with reproducing the data, as well as the issue I posted about the other day.
d7d8b27 So I think this pull request can be closed, if you don't end up finding the same issues you had before.

simsem / semTools

make emmeans more robust to different parameter names #109