simsem / simsem

Package for simulating data for structural equation modeling
http://simsem.org/
56 stars 24 forks source link

Error when using sim() with ordinal data: `Error in cut.default(X[[g]][, o.idx], th.val): 'breaks' are not unique` #85

Closed isaactpetersen closed 5 months ago

isaactpetersen commented 5 months ago

I'm trying to perform a Monte Carlo simulation with ordinal data. The analysis model fits fine in lavaan. However, once I try to perform the simulation using sim(), I receive the following error:

Error in cut.default(X[[g]][, o.idx], th.val): 'breaks' are not unique

Here's a reprex:

## Load Libraries

library("lavaan")
#> This is lavaan 0.6-17
#> lavaan is FREE software! Please report any bugs.
library("simsem")
#> 
#> #################################################################
#> This is simsem 0.5-16
#> simsem is BETA software! Please report any bugs.
#> simsem was first developed at the University of Kansas Center for
#> Research Methods and Data Analysis, under NSF Grant 1053160.
#> #################################################################
#> 
#> Attaching package: 'simsem'
#> The following object is masked from 'package:lavaan':
#> 
#>     inspect
library("tidyverse")
#> Warning: package 'tidyr' was built under R version 4.3.2
#> Warning: package 'readr' was built under R version 4.3.2

# Specify Parameters for Simulation ----

sampleSizeGroup1 <- 390
sampleSizeGroup2 <- 390

itemLoading <- .7
standardThreshold1 <- 0.5
standardThreshold2 <- 1
standardThreshold3 <- 1.5
nonInvariantThreshold1 <- 0
nonInvariantThreshold2 <- 0.5
nonInvariantThreshold3 <- 1
numberRepetitions <- 1
mySeed <- 52242

# Population Model ----

populationModel_group1 <- 
  paste(
    '
    #Specify measurement model factor loadings
    dep =~ ', itemLoading, '*v1 + ', itemLoading, '*v2 + ', itemLoading, '*v3 + ', itemLoading, '*v4

     #Fix latent means to zero
     dep ~ 0

     #Fix latent variances to one
     dep ~~ 1*dep

     #Specify residual variances of manifest variables
     v1 ~~ 1*v1
     v2 ~~ 1*v2
     v3 ~~ 1*v3
     v4 ~~ 1*v4

     #Specify thresholds of manifest variables
     v1 | (', standardThreshold1, ')*t1 + (', standardThreshold2, ')*t2 + (', standardThreshold3, ')*t3
     v2 | (', standardThreshold1, ')*t1 + (', standardThreshold2, ')*t2 + (', standardThreshold3, ')*t3
     v3 | (', nonInvariantThreshold1, ')*t1 + (', nonInvariantThreshold2, ')*t2 + (', nonInvariantThreshold3, ')*t3
     v4 | (', nonInvariantThreshold1, ')*t1 + (', nonInvariantThreshold2, ')*t2 + (', nonInvariantThreshold3, ')*t3
    ',
    sep = "")

populationModel_group2 <- 
  paste(
    '
    #Specify measurement model factor loadings
    dep =~ ', itemLoading, '*v1 + ', itemLoading, '*v2 + ', itemLoading, '*v3 + ', itemLoading, '*v4

     #Fix latent means to zero
     dep ~ 0

     #Fix latent variances to one
     dep ~~ 1*dep

     #Specify residual variances of manifest variables
     v1 ~~ 1*v1
     v2 ~~ 1*v2
     v3 ~~ 1*v3
     v4 ~~ 1*v4

     #Specify thresholds of manifest variables
     v1 | (', standardThreshold1, ')*t1 + (', standardThreshold2, ')*t2 + (', standardThreshold3, ')*t3
     v2 | (', standardThreshold1, ')*t1 + (', standardThreshold2, ')*t2 + (', standardThreshold3, ')*t3
     v3 | (', standardThreshold1, ')*t1 + (', standardThreshold2, ')*t2 + (', standardThreshold3, ')*t3
     v4 | (', standardThreshold1, ')*t1 + (', standardThreshold2, ')*t2 + (', standardThreshold3, ')*t3
    ',
    sep = "")

# Generate population data from population parameter values ----

populationSize <- 100000

populationData_group1 <- simulateData(
  populationModel_group1,
  sample.nobs = populationSize,
  parameterization = "theta",
  seed = mySeed)

populationData_group2 <- simulateData(
  populationModel_group2,
  sample.nobs = populationSize,
  parameterization = "theta",
  seed = mySeed)

populationData <- data.frame(bind_rows(
  populationData_group1,
  populationData_group2,
  .id = "group"))

# Specify Analysis Model ----

analysisModel <- '
 #Specify measurement model factor loadings
 dep =~ c(NA, NA)*v1 + c(NA, NA)*v2 + c(NA, NA)*v3 + c(NA, NA)*v4

 #Fix latent means to zero
 dep ~ c(0, 0)*1

 #Fix latent variances to one
 dep ~~ c(1, 1)*dep

 #Specify residual variances of manifest variables
 v1 ~~ c(1, 1)*v1
 v2 ~~ c(1, 1)*v2
 v3 ~~ c(1, 1)*v3
 v4 ~~ c(1, 1)*v4

 #Specify intercepts of manifest variables
 v1 | c(NA, NA)*t1 + c(NA, NA)*t2 + c(NA, NA)*t3
 v2 | c(NA, NA)*t1 + c(NA, NA)*t2 + c(NA, NA)*t3
 v3 | c(NA, NA)*t1 + c(NA, NA)*t2 + c(NA, NA)*t3
 v4 | c(NA, NA)*t1 + c(NA, NA)*t2 + c(NA, NA)*t3
'

# Run analysis model on generated population data ----

analysisModel_fitted <- lavaan(
  analysisModel,
  data = populationData,
  ordered = paste0("v", 1:4),
  mimic = "Mplus",
  group = "group",
  missing = "pairwise",
  estimator = "WLSMV",
  std.lv = TRUE,
  parameterization = "theta")

# Monte Carlo Simulation to Run Models on Randomly Selected Samples from Generated Data ----

analysisModel_output <- simsem::sim(
  nRep = numberRepetitions,
  model = analysisModel,
  n = list(sampleSizeGroup1, sampleSizeGroup2),
  rawData = populationData,
  ordered = paste0("v", 1:4),
  group = "group",
  lavaanfun = "lavaan",
  missing = "pairwise",
  estimator = "WLSMV",
  std.lv = TRUE,
  parameterization = "theta",
  seed = mySeed)
#> Progress: 1 / 2 
#> Progress: 2 / 2
#> Warning in lavaan::parameterEstimates(out, standardized = TRUE, boot.ci.type =
#> citype, : lavaan WARNING: fmi only available if se = "standard"
#> Warning in lavaan::parameterEstimates(out, standardized = TRUE, boot.ci.type =
#> citype, : lavaan WARNING: fmi only available if estimator = "ML"
#> Warning in lavaan::parameterEstimates(out, standardized = TRUE, boot.ci.type =
#> citype, : lavaan WARNING: fmi only available if missing = "(fi)ml"
#> Warning in lavaan(model = lav, sample.nobs = sample.nobs, ...): lavaan WARNING: variance (theta) values for categorical variables are ignored
#>        if parameterization = "delta"!
#> Error in cut.default(X[[g]][, o.idx], th.val): 'breaks' are not unique

# Session Info ----

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> Random number generation:
#>  RNG:     L'Ecuyer-CMRG 
#>  Normal:  Inversion 
#>  Sample:  Rejection 
#>  
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
#>  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
#>  [9] ggplot2_3.5.1   tidyverse_2.0.0 simsem_0.5-16   lavaan_0.6-17  
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.3     utf8_1.2.4        generics_0.1.3    stringi_1.8.3    
#>  [5] hms_1.1.3         digest_0.6.35     magrittr_2.0.3    timechange_0.3.0 
#>  [9] evaluate_0.23     grid_4.3.1        fastmap_1.1.1     R.oo_1.26.0      
#> [13] R.cache_0.16.0    R.utils_2.12.3    fansi_1.0.6       scales_1.3.0     
#> [17] pbivnorm_0.6.0    mnormt_2.1.1      cli_3.6.2         rlang_1.1.3      
#> [21] R.methodsS3_1.8.2 munsell_0.5.1     reprex_2.1.0      withr_3.0.0      
#> [25] yaml_2.3.8        parallel_4.3.1    tools_4.3.1       tzdb_0.4.0       
#> [29] colorspace_2.1-0  vctrs_0.6.5       R6_2.5.1          stats4_4.3.1     
#> [33] lifecycle_1.0.4   fs_1.6.4          MASS_7.3-60.0.1   pkgconfig_2.0.3  
#> [37] pillar_1.9.0      gtable_0.3.5      glue_1.7.0        xfun_0.43        
#> [41] tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.46        htmltools_0.5.8.1
#> [45] rmarkdown_2.26    compiler_4.3.1    quadprog_1.5-8

Created on 2024-05-03 with reprex v2.1.0

TDJorgensen commented 5 months ago

I prefer not to install the mountain of tidyverse packages on my computer. Your reprex does not need the bind_rows() function:

populationData <- data.frame(rbind(
    cbind(populationData_group1, group = 1),
    cbind(populationData_group2, group = 2)))

This took me 5 hours to track down a very obscure little bug. I hope no one has been using a large sample as a sampling frame ("population" with the rawData= argument) because the population values were never correctly obtained. But no error would be returned with continuous data, so it took a case like this with categorical data to (only indirectly) reveal that there was a problem somewhere (without much of a clue what the problem was). Essentially, sim() tried to use your analysis model as the population model because nothing was provided to generate=. In 2020 (3baefc0) I already avoided that from happening when rawData= was provided, but it was only catching it for the case when rawData= was a list of pre-generated sample data sets (i.e., when n= and nReps= are NULL). I just added a check for the case when rawData= is specified with n= and nReps=, indicating that rawData= is a single sampling frame to draw subsamples from. So you can install the development version to avoid this bug now.

TDJorgensen commented 5 months ago

This took me 5 hours

I didn't mean to make that sound like a guilt trip. Thanks for uncovering this.

isaactpetersen commented 5 months ago

Thanks for fixing this and sorry it took so long to diagnose! I just checked and verified that it works now. Greatly appreciate your help and your work on this package! Thanks again.