Parallelization of emmeans() for gls models is not supported

FehrAaron commented 1 year ago

Describe the bug

Parallelization of the emmeans function is not possible, if the input model is a gls model made with the nmle package. The code below works when run not in parallel (%do% instead of %dopar%) and can be parallelized if a normal linear model (lm(value ~ timepoint * factor, data = lm_in)) is used instead.

To reproduce


set.seed(123)

#Generate mock data
df = rbind(data.frame(
    factor = factor(rep(1:2,10), labels = c("F1","F2")),
    time = factor(rep(1:5, each = 4)),
    value = rnorm(20, 100, 10),
    Sample = 1),
    data.frame(
    factor = factor(rep(1:2,10), labels = c("F1","F2")),
    time = factor(rep(1:5, each = 4)),
    value = rnorm(20, 150, 11),
    Sample = 2))

#Setup cores
num_cores = 2
cl = makeCluster(num_cores)
registerDoParallel(cl)

#Calculate estimated marginal means
result = foreach(i = c(1,2), .combine = rbind, .packages = c("dplyr","emmeans","nlme")) %dopar% {

  lm_in = df %>%
    dplyr::filter(Sample == i)

  model <- gls(value ~ time * factor, data = lm_in, weights = varIdent(form = ~ 1 | factor))

  est_marginal_means = emmeans(model, specs = ~ time*factor)

  as.data.frame(est_marginal_means)

}

stopCluster(cl)

Expected behavior

The result dataframe should be formed containting the estimated marginal means for all samples.

Instead, an error occurs:

Error in { : task 1 failed - "Perhaps a 'data' or 'params' argument is needed"

stop(simpleError(msg, call = expr))
e$fun(obj, substitute(ex), parent.frame(), e$data)
foreach(i = c(1, 2), .combine = rbind, .packages = c("dplyr", "emmeans", "nlme")) %dopar% { lm_in = df %>% dplyr::filter(Sample == i) model <- gls(value ~ time * factor, data = lm_in, weights = varIdent(form = ~1 | ...

Additional context

Vectorization of the code using lapply also did not work and lead to the same error.

FehrAaron commented 1 year ago

Ok, I guess I found the solution, by just adding the data to emmeans it works. But I have no idea why it is required only for the gls model and only when running in parallel. The change in the code above would be:

est_marginal_means = emmeans(model, specs = ~ time*factor)

rvlenth commented 1 year ago

Hmmm. I'm glad you found a workaround. The basic issue is that emmeans() (actually ref_grid()) requires two steps -- one to reconstruct the dataset, and the other to put together the reference grid. The data-recovery step is defined by a recover_data() method which is different for each model class. That method for gls is a bit dicey because of complications with weight functions and such (see the code for emmeans:::recover_data.gls.) I think that is why it hits a snag.

FehrAaron commented 1 year ago

Thanks, I will have a look!

rvlenth / emmeans