melissagwolf / dynamic

Dynamic Fit Index Cutoffs For Latent Variable Models
GNU Affero General Public License v3.0
15 stars 2 forks source link

Specify estimator #5

Closed WillemSleegers closed 1 year ago

WillemSleegers commented 1 year ago

Thanks for this great package (and fantastic associated papers)!

I was wondering whether it is possible to add an option to specify the estimator. In my code I'm using the MLR estimator but I don't think dynamic is using that estimator under the hood and instead only relies on ML. I think it might be beneficial to add different estimators, particularly the more robust ones.

melissagwolf commented 1 year ago

Good catch - dynamic does indeed rely on ML. This is because dynamic simulates multivariate normal data with non-missingness to estimate the cutoffs, so there's no need to use a MLR estimator (the results would be identical to ML). This is certainly a limitation of the method as it exists currently, but we felt that asking users to upload the skew and kurtosis for each of their items would be too burdensome.

We're currently working on a bootstrapping extension in which users can upload their raw data instead of their standardized loadings, which will better capture any non-normality, and we plan to allow users to choose MLR in that case. We're also working on extensions for categorical data (e.g., WLSMV estimator). Stay tuned!

WillemSleegers commented 1 year ago

Sorry for re-opening the issue but I'd like to ask a clarification question. I thought one of the use-cases for using the MLR estimator is to deal with non-normality, not missingness. I also expect different results because I get the robust variants of the fit indices (e.g., RMSEA.robust, CFI.robust), which I don't think are used in your simulation and so would produce different distributions to obtain the cutoffs from. So would it not be possible and/or better to have an argument to specify robust = TRUE or something like that and then use the robust variants of the indices rather than the normal ones?

melissagwolf commented 1 year ago

Correct, it's for non-normality. To calculate the cutoffs, we simulate the data using a package called simstandard which simulates multivariate normal data from user-inputted standardized loadings using mvtnorm. We then fit several models to it using lavaan: one of which is correctly specified and several which are incorrectly specified, and then return DFI cutoff values that can distinguish between the two distributions of cutoffs (e.g., correctly specified vs misspecified).

It sounds like what you're asking for is for us to return the RMSEA.robust and CFI.robust from the fitMeasures function, which are returned automatically when you use estimator=MLR. Let's use an empirical example to see what happens when we simulate multivariate normal data and then estimate a solution using MLR.

We'll use the standardized loadings from the HolzingerSwineford1939 dataset, and our model will be:

cfamod <- "F1 =~ x1 + x2 + x3 F2 =~ x4 + x5 + x6 F3 =~ x7 + x8 + x9"

We can run a cfa on this using cfa from lavaan and estimator=MLR. We then use standardizedSolution to get the standardized loadings, which gives us the following model statement:

standardizedmod <- "F1 =~ .772*x1 + .424*x2 + .581*x3 F2 =~ .852*x4 + .855*x5 + .838*x6 F3 =~ .570*x7 + .723*x8 + .665*x9"

We can then simulate the multivariate normal data using the simstandard package with 500 replications, as we do in dynamic. We begin by setting a seed for replicability.

set.seed(888) dat <- sim_standardized(standardizedmod,n=301*500, estimator=MLR)

We put the data in list form by adding an id for each of the 500 datasets:

id <- rep(1:301,times=500) dat_id <- cbind(id,dat) dat_list <- dat_id %>% dplyr::group_by(id) %>% tidyr::nest() %>% as.list()

We then run 500 cfa models with estimator=MLR so that we get back both types of fit indices:

fit_list <- purrr::map(dat_list$data, ~cfa(model=misspecifiedmod,data=., estimator="MLR"))

We extract the CFI, CFI.robust, RMSEA, and RMSEA.robust into a dataset.

fit_dat <- purrr::map_dfr(fit_list,~fitMeasures(., c("cfi","cfi.robust","rmsea","rmsea.robust")))

Running colMeans on fit_dat shows us that the robust versions of each fit index are virtually identical to their counterpart:

colMeans(fit_dat)

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

cfi | cfi.robust | rmsea | rmsea.robust -- | -- | -- | -- 0.997602 | 0.997503 | 0.010278 | 0.010547


Thus, because we are simulating multivariate normal data to produce the cutoffs, there is no reason to add an option to return cfi.robust and rmsea.robust. They will be the same because the underlying data that the cutoffs are based on is known to be multivariate normal.

This is because, e.g., rmsea.robust is calculated using the sample-corrected robust RMSEA equation in https://www.tandfonline.com/doi/abs/10.1080/00273171.2012.715252?journalCode=hmbr20, which is defined by the authors as "it estimates what the usual uncorrected RMSEA values would have been had the data been normal". Because the data is, in fact, normal, the estimates will be extremely similar.

WillemSleegers commented 1 year ago

Thanks for clarifying; that makes sense to me! I suppose then the solution to actually obtain the more robust estimates is by using something like bootstrapping, so I'm glad that you mentioned earlier that you are working on that!

melissagwolf commented 1 year ago

Glad to hear that helped! I'll try to remember to give you a heads up when we add the bootstrapping approach.