simsem / semTools

Useful tools for structural equation modeling
74 stars 35 forks source link

Why are different fit indices for do.fit = TRUE and do.fit = FALSE displayed in the ParcelAllocation function? #118

Closed HHKK44 closed 1 year ago

HHKK44 commented 1 year ago

Hello,

(apart from these error messages, which are displayed in reprex but not in console and therefore have no effect on my results. Side question: Why do error messages appear much more often in reprex that have no meaning for the real calculation in R?)

1st main question:

It is not yet 100% clear to me why the variant "do.fit = TRUE" yields different fit values than if I save the 100 data sets beforehand via "do.fit = FALSE" and then test each of the data sets with a confirmatory factor analysis? In both variants, 100 data sets are formed, 100 fit indices (rmsea, srmr etc.) are calculated via a CFA and the averaged indices are output?

But why are the fit indices different for the two functions/ways?

The fit values of the "do.fit = True" values seem to be more correct than the values of the "do.fit = FALSE" (reminiscent of a saturated fit, extremely high p-value, and rmsea and srmr values etc. but with many degrees of freedom displayed at the same time).

I have attached the results of the two fit analyses with pictures.

2nd main question:

The mod indices I get from the fit.parcels of the cfa.mi calculation, on the other hand, seem to be usable again. Or can I not trust them when the fit values look so unusual at do.fit = FALSE?

(Side question: how exactly is this sentence to be understood? " Statistics for releasing one or more fixed or constrained parameters in model can be calculated by pooling the gradient and information matrices across imputed data sets in a method proposed by Mansolf, Jorgensen, & Enders (2020)". Does this mean that we have suddenly given enough information so that restrictions can be lifted and our dfs are still positive? What would be a simple example of this?)

#### 1.4 Testing of all measurement models as CFA - Parceling: MHL / ATSPH ####

library(semTools)
#> Loading required package: lavaan
#> This is lavaan 0.6-12
#> lavaan is FREE software! Please report any bugs.
#> 
#> ###############################################################################
#> This is semTools 0.5-6
#> All users of R (or SEM) are invited to submit functions or ideas for functions.
#> ###############################################################################
library(lavaan)

mod4 <- '
    # Item Syntax

    EP =~ 
    ATSPHS1 + ATSPHS2 + ATSPHS3 + 
    ATSPHS4 + ATSPHS5 + ATSPHS6 + 
    ATSPHS7 + ATSPHS8 + ATSPHS9 + ATSPHS10
    ST =~ STG1 + STG2 + STG3 + STG4 + STG5
    MHL =~   
    MHLS1 + MHLS2 + MHLS3 + MHLS4 + MHLS5 + 
    MHLS6 + MHLS7 + MHLS8 + MHLS9 + MHLS10 +
    MHLS11 + MHLS12 + MHLS13 +  MHLS14 +    MHLS15 +    
    MHLS16 + MHLS17 +   MHLS18 +    MHLS19 +    MHLS20 +
    MHLS21 + MHLS22 + MHLS23 +  MHLS24 + MHLS25 + 
    MHLS26 + MHLS27 + MHLS28 + MHLS29 + MHLS30 + 
    MHLS31 + MHLS32 + MHLS33 + MHLS34 + MHLS35'

mod4_ang1 <- '
    # Measurment Model
    EP =~ par1 + par2 + par3 + par4
    ST =~ STG1 + STG2 + STG3 + STG4 + STG5
    MHL =~ par5 + par6 + par7 + par8

    # Residual variance
    STG1 ~~ STG2'

## Name der Parcels
(parcel.names <- paste0("par", 1:8))
#> [1] "par1" "par2" "par3" "par4" "par5" "par6" "par7" "par8"

parcelAllocation(mod4_ang1, df, parcel.names, mod4, 
                 nAlloc = 100, fun = "cfa", alpha = 0.05, 
                 fit.measures = c("chisq", "df", "cfi", "tli", "rmsea", "srmr"),
                 show.progress = FALSE, iseed = 12345, do.fit = TRUE, 
                 return.fit = TRUE, warn = FALSE)
#> Error in lavaan::lavaan(model = "\n    # Item Syntax\n    \n    EP =~ \n    ATSPHS1 + ATSPHS2 + ATSPHS3 + \n    ATSPHS4 + ATSPHS5 + ATSPHS6 + \n    ATSPHS7 + ATSPHS8 + ATSPHS9 + ATSPHS10\n    ST =~ STG1 + STG2 + STG3 + STG4 + STG5\n    MHL =~   \n    MHLS1 + MHLS2 + MHLS3 + MHLS4 + MHLS5 + \n    MHLS6 + MHLS7 + MHLS8 + MHLS9 +\tMHLS10 +\n    MHLS11 + MHLS12 + MHLS13 +\tMHLS14 +\tMHLS15 +\t\n    MHLS16 + MHLS17 +\tMHLS18 +\tMHLS19 +\tMHLS20 +\n    MHLS21 + MHLS22 + MHLS23 +\tMHLS24 + MHLS25 + \n    MHLS26 + MHLS27 + MHLS28 + MHLS29 + MHLS30 + \n    MHLS31 + MHLS32 + MHLS33 + MHLS34 + MHLS35", : lavaan ERROR: data is a function; it should be a data.frame

dataList <- parcelAllocation(mod4_ang1, df, parcel.names, mod4, nAlloc = 100, iseed = 12345,
                             do.fit = FALSE)
#> Error in lavaan::lavaan(model = "\n    # Item Syntax\n    \n    EP =~ \n    ATSPHS1 + ATSPHS2 + ATSPHS3 + \n    ATSPHS4 + ATSPHS5 + ATSPHS6 + \n    ATSPHS7 + ATSPHS8 + ATSPHS9 + ATSPHS10\n    ST =~ STG1 + STG2 + STG3 + STG4 + STG5\n    MHL =~   \n    MHLS1 + MHLS2 + MHLS3 + MHLS4 + MHLS5 + \n    MHLS6 + MHLS7 + MHLS8 + MHLS9 +\tMHLS10 +\n    MHLS11 + MHLS12 + MHLS13 +\tMHLS14 +\tMHLS15 +\t\n    MHLS16 + MHLS17 +\tMHLS18 +\tMHLS19 +\tMHLS20 +\n    MHLS21 + MHLS22 + MHLS23 +\tMHLS24 + MHLS25 + \n    MHLS26 + MHLS27 + MHLS28 + MHLS29 + MHLS30 + \n    MHLS31 + MHLS32 + MHLS33 + MHLS34 + MHLS35", : lavaan ERROR: data is a function; it should be a data.frame

fit.parcels <- cfa.mi(mod4_ang1, data = dataList, std.lv = TRUE)
#> Error in is.data.frame(data): object 'dataList' not found
summary(fit.parcels, fit.measures = TRUE) # uses Rubin's rule
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'fit.parcels' not found
fitMeasures(fit.parcels, fit.measures = c("chisq", "rmsea","srmr","cfi", "tli"))
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'fitMeasures': object 'fit.parcels' not found

#######

modindices.mi(fit.parcels, sort. = TRUE, minimum.value = 5) # default: Li et al.'s (1991) "D2" method
#> Error in stopifnot(inherits(object, "lavaan.mi")): object 'fit.parcels' not found
# modindices.mi(fit.parcels, test = "D1") # Li et al.'s (1991) "D1" method

Created on 2023-02-18 with reprex v2.0.2

Bildschirmfoto 2023-02-18 um 14 37 41 Bildschirmfoto 2023-02-18 um 14 39 55 Bildschirmfoto 2023-02-18 um 15 04 37 Bildschirmfoto 2023-02-18 um 15 06 47 Bildschirmfoto 2023-02-18 um 15 06 53
TDJorgensen commented 1 year ago

Why do error messages appear much more often in reprex that have no meaning for the real calculation in R?

It is not a reproducible example ("reprex" for short) if you do not provide both the data and the R syntax, so that someone else can reproduce it themselves. The R package called reprex doesn't do that, it just catches the output for you to copy/paste, which I supposed can facilitate creating an actual reprex, but it is neither necessary nor sufficient. I'm not sure how the package works, but paying closer attention to the content of the error messages reveals the source of the errors. The first two errors result from calling parcelAllocation(), and those messages both end with:

lavaan ERROR: data is a function; it should be a data.frame

When I look back at the call that produced this error message, I see that you passed the df() function to the data= argument. Perhaps in some other syntax file, you created a data frame that you assigned to an object called df, but that data was not available in the workspace when you ran the call in your posted example.

The remaining arguments logically follow from the first pair. Because dataList <- parcelAllocation(mod4_ang1, df, ...) resulted in an error, the dataList object was never created, which resulted in the message:

Error in is.data.frame(data): object 'dataList' not found

which followed the call to cfa.mi(). And because that call produced an error, the fit.parcels object was never created, which is what the next error message tells you, and so forth.

Some advice for making a future reprex, if you cannot share your actual data: Try using the help-page example data to illustrate the problem, or if that does not suffice, try simulating data that reproduces your problem. Then the simulation syntax can be provided, so it is fully reproducible.

But why are the fit indices different for the two functions/ways?

They aren't comparable. When you set do.fit=TRUE, the model is fitted to each data, but each analysis is treated as independent. The output just shows you what the average chi-squared value is, and the average of each fit index (each one being calculated using the chi-squared obtained by analyzing that set of parcels). When you set do.fit=FALSE to save the list of data sets and analyze with cfa.mi(), your chi-squared is not averaged across data sets. It is pooled using one of the methods described in the ?lavTestLRT.mi help-page references, and the fit indices are calculated from that pooled chi-squared statistic.

can I not trust them when the fit values look so unusual at do.fit = FALSE?

They are not unusual. The chi-squared is a pooled test statistic, as are the chi-squared (modification indices) statistics returned by modindices.mi()

Side question

That simply informs you that there are different pooling methods. You can read the cited article for more information.

HHKK44 commented 1 year ago

Hi Terry,

thanks for your feedback.

I had first read the term Repex in your response to Stackoverflow. Then I searched for it and it suggested the function to me. I didn't know if that's what you meant or a completely reproducible example. I'm a Master's student and I worked with Reprex for the first time. So I apologise for my lack of knowledge and confusion with the error codes! The error messages happened exclusively because of the reprex function.

I find the values output via the do.fit = false values quite unusual: chi sqaure p-value = 1.0, CFI = 1.0, TLI = 1.099, r.msea = 0.0; srmr = 0.064.

vs.

Do.fit = true CFI = 0.911; TLI = 0.877, rmsea = 0.80, srmr = 0.079

You don't find that? So which values are more meaningful? Those of the Do.Fit = False or those of the Do.Fit = True version?

Thanks for your effort!

Kind regards

Lukas

TDJorgensen commented 1 year ago

I find the values output via the do.fit = false values quite unusual

I don't see why you are fixated. The pooled test statistic is much lower than the range of tests across data sets, which reflects a lot of "between-imputation" variability (or in this context, what Sterba called "parcel-allocation variability". It is a lot of uncertainty due to arbitrary parcels yielding very different results across allocations. There is nothing "good" about the average of those test statistics. The pooled statistic reflects this lack of precision with a corresponding drop in power (lower test statistic), and you would probably see its FMI is low if you added "fmi" to your fitMeaures(). The other fit indices just follow from that, so they are not surprising either. Except that the SRMR for pooled results is about the same as the average across allocations, but SRMR is based on estimates, not on the chi-squared statistic.

which values are more meaningful?

Refer to my response to your original post on StackExchange.