simsem / semTools

Useful tools for structural equation modeling
75 stars 36 forks source link

fix missing data issue #119

Open mattansb opened 1 year ago

mattansb commented 1 year ago

This is a WIP

mattansb commented 1 year ago

Hi @rvlenth, Thanks for your help on the issue of dealing with missing data. I have taken your advice and solved this by allowing the user to pass a data= argument with non-missing data, which I deal with in recover_data.lavaan().

For ref_grid() and emmeans() this seems to work fine.

However, for emtrends() I am getting estimation problems. Using some debugging, I've found that when emtrends() is called, it called recover_data() twice, but only passes the user's data= argument the first time. I'm assuming this in not intentional?

Thanks!

# remotes::install_github("mattansb/semTools") # install this PR
library(semTools)
library(emmeans)

data("mtcars")
raw_mtcars <- mtcars
mtcars$hp[1] <- NA

model <- " mpg ~ hp + drat + hp:drat "

fit <- sem(model, mtcars, missing = "fiml.x")

(rg <- ref_grid(fit, 
               lavaan.DV = "mpg",
               data = raw_mtcars))
#> 'emmGrid' object with variables:
#>     hp = 146.69
#>     drat = 3.5966

rg@linfct
#>   (Intercept)       hp     drat  hp:drat
#> 1           1 146.6875 3.596563 527.5708

(emM <- emmeans(fit, ~ drat, var = "hp",
                lavaan.DV = "mpg",
                data = raw_mtcars))
#>  drat emmean    SE  df asymp.LCL asymp.UCL
#>   3.6     20 0.614 Inf      18.8      21.2
#> 
#> Confidence level used: 0.95

emM@linfct
#>      (Intercept)       hp     drat  hp:drat
#> [1,]           1 146.6875 3.596563 527.5708

(emT <- emtrends(fit, ~ drat, var = "hp",
                 lavaan.DV = "mpg",
                 data = raw_mtcars))
#>  drat hp.trend SE df asymp.LCL asymp.UCL
#>   3.6   nonEst NA NA        NA        NA
#> 
#> Confidence level used: 0.95

emT@linfct
#>      (Intercept) hp drat hp:drat
#> [1,]           0 NA    0      NA
rvlenth commented 1 year ago

I'm not at all sure that it isn't intentional. The first call to ref_grid() includes a hook to return the data, so that we can set up the difference quotients. The second time we call it, we put another hook that bypasses some stuff already done in the first call. I'll have to look at it to see if we need the data the second time.

rvlenth commented 1 year ago

I think it is right the way it is. The setup for the first call to ref_grid() includes this code:

    rgargs = list(object = object, ...)
   . . .
    data = do.call("ref_grid", c(rgargs))

So if data is included in the ... in the emtrends() call, it gets passed to ref_grid(). As you can see, the purpose of that first call is to retrieve the data (via a special hook included in rgargs).

The second call to ref_grid() is

bigRG = do.call("ref_grid", c(rgargs, data = data))

where data is the data already retrieved in the first call.

So actually I'm confused by your statement that data is passed the first time and not the second, because what we actually have is data being explicitly passed the second time, and only implicitly passed the first time.

rvlenth commented 1 year ago

OK, my bad! It turns out that if rgargs is a list and data is a data frame with variables x and y, then c(rgargs, data = data) is a list with additional elements data.x and data.y. So I put in an additional line of code to add data itself to the list, and confirmed in debug mode that the right stuff is being passed.. You can install from GitHub and see if it works right now.

mattansb commented 1 year ago

Hey, this almost fixes the issue. I now get a new error:

(emT <- emtrends(fit, ~ drat, var = "hp",
                 lavaan.DV = "mpg",
                 data = raw_mtcars))
#> Error in lav_data_full(data = data, group = group, cluster = cluster,  : 
#>   lavaan ERROR: some (observed) variables specified in the model are not found in the dataset: mpg

This is because the data being passed to recover_data() the second time only has the data for the predictors (from the first pass of recover_data()), but lavaan needs the full multivariate/multivariable dataset.

Can we not simply pass the original data= argument the second time as well?

rvlenth commented 1 year ago

You can use the addl.vars argument, e.g., addl.vars = "mpg"

rvlenth commented 1 year ago

By the way, in your emmeans support code for lavaan, since you need the response variable, I recommend you retrieve its name from the ressponse part of the model formula, and include that as addl.vars in the call to recover_data(). Then you won't have to rely on the user providing that in their call. See the help page for emmeans::recover_data.

patc3 commented 1 year ago

Any update on this issue? Has this been added to simsem?

rvlenth commented 1 year ago

@patc3 No additional updates from me (emmeans) since my last comment. My repairs to recover_data are in the latest CRAN version and AFAIK, the additional notes (e.g., using addl.vars) will provide access to all the needed variables.

mattansb commented 1 year ago

Sorry @patc3 - I haven't found the time to get back to this just yet.