knickodem / kfa

k-fold cross validation for factor analysis
GNU General Public License v3.0
7 stars 1 forks source link

find_k power analysis N greater than actual N #13

Closed psanker closed 1 year ago

psanker commented 1 year ago

Hey Kyle,

We're running into an interesting issue where kfa determines that there are not enough observations to run the cross-validation, even if there are more than 1000 for 5 variables. After hunting around, it seems that power.n sometimes can be greater than n : https://github.com/knickodem/kfa/blob/234f60b7d8115bca2520d39029306f6793db12d8/R/find_k.R#L47-L52

It seems it might be related to the degrees of freedom calculation:

https://github.com/knickodem/kfa/compare/bc27e49...62ad2a7#diff-00561a07c782d0fcb5d1ce92da8380968424636f3571559d7e3c40c3985636b9

Note the sign change from + to -. Could this be related to switching to geomin and then reverting?

Please see the reprex attached for the error message. The attached data is the input.

test.csv

R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
Platform: x86_64-apple-darwin17.0 (64-bit)

r$> dat <- read.csv("test.csv")

r$> inp <- na.omit(dat)

r$> mod <- kfa::kfa(inp, ordered = TRUE)
Power analysis indicates the sample size is too small for k-fold cross validation.
    Adjust assumptions or manually create a single holdout sample.
[1] "Using 3 cores for parallelization."
[1] "Finished EFAs. Starting CFAs"

r$>
Session Information ``` r$> sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.6 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] nvimcom_0.9-143 loaded via a namespace (and not attached): [1] splines_4.2.1 foreach_1.5.2 prodlim_2019.11.13 assertthat_0.2.1 [5] askpass_1.1 semTools_0.5-6 stats4_4.2.1 renv_0.15.5 [9] globals_0.16.0 pbivnorm_0.6.0 gdtools_0.2.4 ipred_0.9-13 [13] pillar_1.9.0 lattice_0.20-45 glue_1.6.2 pROC_1.18.0 [17] uuid_1.1-0 digest_0.6.29 GPArotation_2022.4-1 hardhat_1.2.0 [21] colorspace_2.0-3 recipes_1.0.1 htmltools_0.5.2 Matrix_1.4-1 [25] plyr_1.8.7 timeDate_4021.104 pkgconfig_2.0.3 listenv_0.8.0 [29] caret_6.0-93 purrr_1.0.1 scales_1.2.1 gower_1.0.0 [33] officer_0.6.1 lava_1.6.10 tibble_3.1.8 openssl_2.0.2 [37] generics_0.1.3 ggplot2_3.3.6 withr_2.5.0 nnet_7.3-17 [41] cli_3.6.0 mnormt_2.1.0 kfa_0.2.2 survival_3.3-1 [45] magrittr_2.0.3 evaluate_0.16 fansi_1.0.3 future_1.27.0 [49] parallelly_1.32.1 doParallel_1.0.17 nlme_3.1-157 MASS_7.3-59 [53] xml2_1.3.3 class_7.3-20 textshaping_0.3.6 tools_4.2.1 [57] data.table_1.14.2 lifecycle_1.0.3 stringr_1.4.1 flextable_0.7.3 [61] munsell_0.5.0 zip_2.2.0 compiler_4.2.1 systemfonts_1.0.4 [65] rlang_1.1.0 grid_4.2.1 iterators_1.0.14 lavaan_0.6-12 [69] base64enc_0.1-3 rmarkdown_2.14 ModelMetrics_1.2.2.2 gtable_0.3.0 [73] codetools_0.2-18 DBI_1.1.2 reshape2_1.4.4 R6_2.5.1 [77] simstandard_0.6.3 lubridate_1.8.0 knitr_1.39 dplyr_1.0.9 [81] fastmap_1.1.0 future.apply_1.9.0 utf8_1.2.2 ragg_1.2.5 [85] stringi_1.7.8 parallel_4.2.1 Rcpp_1.0.8.3 vctrs_0.6.3 [89] rpart_4.1.16 tidyselect_1.1.2 xfun_0.32 ```
knickodem commented 1 year ago

Hi Patrick,

Yes, power.n can be greater than n. The former is the output from semTools::findRMSEAsamplesize and is the minimum sample size needed reject or retain a model based on the specifications of the rmsea0 and rmseaA arguments. For instance,

semTools::findRMSEAsamplesize(rmsea0 = .05, rmseaA = .08, df = 5)
[1] 1464

The current df calculation is correct. The change you highlight was part of a temporary miscalculation when we switched from using the df calculation from an EFA model to that of a typical CFA model.

In factor analysis, as with SEM more broadly, the power is derived largely from the number variables in the model rather than the sample size. Consequently, a large sample is suggested when only 5 variables are in the model, but an unreasonably small sample size is suggested if there are a lot of variables, which is why we include the (now named) min.nk argument. Below is for 30 items and 5 factors, so df = 395:

semTools::findRMSEAsamplesize(rmsea0 = .05, rmseaA = .08, df = 395)
[1] 56

I've updated the nomenclature so the nk suffix refers to sample size per fold and added more to the documentation to hopefully clarify each component of the output (864d6c0). Let me know if you have any suggestions.