simsem / semTools

Useful tools for structural equation modeling
74 stars 36 forks source link

Error in pooling SE? #22

Closed davidfbradley closed 7 years ago

davidfbradley commented 7 years ago

Hi, I have been using the runMI function, which has been generally fantastic. I ran into the "infinite recursion" problem in the current CRAN version of semTools, and downloaded the development/GitHub versions of lavaan and semTools, which did fix that problem. However, I came across what I think is an error in pooling the standard errors. Namely, the standard errors seem to be a little less than m times higher than when I use lavaan's sem function, where m is the number of imputations (examples, using 5 imputations: 0.254 vs 0.059; 0.186 vs 0.043). I also reran some previous analyses and compared the new output to what I got with an older version of semTools, and the same issue is evident (example, m=5: .165 vs .047). The issue is in all portions of the output (e.g., latent variable loadings, regressions, thresholds, variances). I am using lavaan 0.6-1.1125, semTools 0.4-15.910, Amelia II 1.7.4. Is this a problem with the function, or is this user error? Here is my code:

test_H1a <- runMI(model = model_H1a, data = x_H1, fun="sem", m=5, ordered=([snip]), estimator="WLSMV", test = "Satorra.Bentler" )

TDJorgensen commented 7 years ago

Higher SE with multiple imputation than with listwise deletion only suggest the fraction of missing information is high. I see no evidence of a bug, just a coincidence. If you try 100 imputations, are the pooled SE approximately 100 times larger? I suspect they might be smaller than with 5 imputations because the between-imputation variability will have less Monte Carlo sampling error. Here is a script that shows you get the exact same SE with multiple imputation as with complete-case analysis when data are complete (i.e., no between-imputation variance).

library(semTools)
example(cfa)
## 5 "imputations" (complete data, so all identical)
fakeImps <- list(HolzingerSwineford1939,
                 HolzingerSwineford1939,
                 HolzingerSwineford1939,
                 HolzingerSwineford1939,
                 HolzingerSwineford1939)
summary(cfa.mi(HS.model, data = fakeImps))

If the error you presume were really there, then the SE column would not be identical in the 2 summaries.

davidfbradley commented 7 years ago

Hi Terry, thanks for your response.

I replicated your example, and yes, I too got identical results. That's a good argument that it's not a simple "forgot to divide by m" kind of problem; I'm a bit of a simpleton when it comes to R, and I'm stumbling my way through statistics in general, so I'm not surprised I got that wrong! You are also correct that increasing the number of imputations does decrease SE, though even m=100 does not reach the result I get from sem (example: sem=.122, m100=.390, m5=.442), nor does it reach the results I got using runMI a week ago (the SE were higher using runMI [m=5] compared to sem, but not 3-4 times higher; more like .125 with runMI vs. .122 with sem).

This is a dataset with minimal missingness (i.e., in the whole dataset, no cases with more than 3% missing data, no variables with more than 1% missing data, and running these particular analyses with listwise deletion removes only 20 out of 440 cases), so I am skeptical that there could be such a strong effect on the standard errors - and again, such a huge effect was not evident when I used runMI with the same dataset a week or two ago, with earlier builds of lavaan and runMI. This does suggest to me some sort of error somewhere in one of the packages I am using, but not the simple "divide by m" error I guessed at first, but I wouldn't presume to guess where the problem is.

In any event, I apologize for sending you on a wild goose chase for that particular bug!

TDJorgensen commented 7 years ago

No need to apologize, we encourage bug reports! I would not expect results to differ after updating, because the internal changes to lavaan and semTools should not have affected estimated parameters or sampling variances. Did you use the same imputed data sets? With only 5 imputations, you can expect a lot of variability from one set of 5 imputations to another set of 5, which is why it is good to use more. 20/440 is still about 5% of your sample, although your fraction of missing information could be more or less than that (depends on the mechanism and how informative the auxiliaries in your imputation model are). Without knowing you used the same imputed data sets with both versions, I would not rush to the conclusion that the pooling procedures have changed. Since you can still download older versions from CRAN, you can compare results by setting the seed with set.seed() before imputing data, then use old and new versions of the packages to fit the same model to the same data. If that shows different results, please share your data and syntax with me via email so I can find what is going wrong.

davidfbradley commented 7 years ago

Hi Terry, thanks for your patience! I found my error. When you mentioned the auxiliaries of the imputation model, I started poking around with how I was handling the imputations and (eventually) realized that I failed to specify certain variables as nominal or ordinal using the miArgs argument in runMI. When I include that information in the runMI function, the standard errors are pretty darn close to what I get when I run the analysis using the regular sem function. Thanks again for your help!