tmatta / lsasim

Simulate large scale assessment data
6 stars 5 forks source link

Correlation matrix #14

Closed wleoncio closed 3 years ago

wleoncio commented 3 years ago

0. Setup

I've tested most values below. Not all testings are shown in this report. I only included the testings that are showing errors/warnings or inconsistent results.

cluster_gen_2 <- function(...) {
  cluster_gen(..., verbose = FALSE, calc_weights = FALSE)
}
set.seed(12334)
n1 <- c(3, 6)
n2 <- c(groups = 4, people = 2)
n3 <- c(school = 3, class = 2, student = 5)
n4 <- c(20, 50)
n5 <- list(school = 3, class = c(2, 1, 3), student = c(20, 20, 10, 30, 30, 30))
n5a <- list(school = 3, class = c(2, 3, 3), student = c(20, 20, 10, 30, 30, 30))
n6 <- list(school = 3, class = c(2, 1, 3), student = ranges(10, 50))
n6a <- list(school = 3, class = c(2, 3, 3), student = ranges(10, 50))
n7 <- list(school = 10, student = ranges(10, 50))
n8 <- list(school = 3, student = c(20, 20, 10))
n8a <- list(school = 3, class = c(2, 2, 2),student = c(20, 20, 10))
n8b <- list(school = 3, class = c(2, 3, 3),student = c(20, 20, 10, 5))
n8c <- list(school = 3, class = c(2, 1, 3),student = c(20, 20, 10))
n9 <- list(school = 10, class = c(2,1,3,1,1,1,2,1,2,1), student = ranges(10, 50))
n10 <- list(country = 2, school = 10, class = c(2,1,3,1,1,1,2,1,2,1), student = ranges(10, 50))
n11 <- list(culture = 2, country = 2, school = 10, class = c(2,1,3,1,1,1,2,1,2,1), student = ranges(10, 50))
n12 <- list(culture = 2, country = 2, district = 3, school = 10, class = c(2,1,3,1,1,1,2,1,2,1), student = ranges(10, 50))
N1 <- c(100, 20)

4. Correlation matrix

Overall suggestions

  1. [x] When “cor_matrix” is specified, the default variances are not explained.

  2. [x] When setting cor_matrix independently, the number of variables is directly set as the number of rows. However, when setting “cor_matrix”and other arguments (“c_mean”, “sigma”) together, the number of variables will not be set directly. The number of continuous and categorical variables (n_X and n_W) should be specified, otherwise there will be an error.

Error and warning messages

  1. [x] Error: Improper correlation matrix. Make sure all eigenvalues are non-negative

  2. [x] Error: n_X + n_W + theta must not be different from ncol(cor_matrix). The former add up to 8, whereas the latter equals 3 Maybe we could have more explanations for this error message in the manual.

  3. [x] Setting the correlation matrix to m3 would lead to the following error: Error in n_W[[l]] : subscript out of bounds When I change the data structure to a simpler one, it would work. The cor_matrix seems to be incompatible with multi-level sample design.

  4. [x] For detailed error messages, see working log 0824-0831

set.seed(12334)

#Why?
m2 <- matrix(c(1, 0.05, 0.8,
               0.05, 1, 0.77,
               0.8, 0.77, 1), 3, 3)
c2 <- cluster_gen_2(n5, cor_matrix = m2)
## Error: Improper correlation matrix. Make sure all eigenvalues are non-negative
#Improper correlation matrix. Make sure all eigenvalues are non-negative

c2_1 <- cluster_gen_2(n5, n_X=3, cor_matrix = m2)
## Error: n_X + n_W + theta must not be different from ncol(cor_matrix). The former add up to 8, whereas the latter equals 3
#Error: n_X + n_W + theta must not be different from ncol(cor_matrix). The former add up to 11, whereas the latter equals 3

c2_2 <- cluster_gen_2(n5, n_X=3, n_W=0, cor_matrix = m2)
## Error: Improper correlation matrix. Make sure all eigenvalues are non-negative
#Error: Improper correlation matrix. Make sure all eigenvalues are non-negative

c2_3 <- cluster_gen_2(n5, n_X=2, n_W=1, cor_matrix = m2)
## Error: Improper correlation matrix. Make sure all eigenvalues are non-negative
#Error: Improper correlation matrix. Make sure all eigenvalues are non-negative

#Why?
m3 <- matrix(c(1,0.5,0.8,
               0.5,1,0.77,
               0.8,0.77,1),3,3)
set.seed(12334)
c3 <- cluster_gen_2(n5, cor_matrix = m3)
## Error in n_W[[l]]: subscript out of bounds
c3a <- cluster_gen_2(n4, cor_matrix = m3)
#Error in n_W[[l]] : subscript out of bounds
# same error message for : c3 <- cluster_gen_2(n3, cor_matrix = m3)

m4 <- matrix(c(1,0.12,0.1,
             0.12,1,0.11,
             0.1,0.11,1),3,3)
set.seed(12334)
c4 <- cluster_gen_2(n4, cor_matrix = m4)
summarize_clusters(c4)
## ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Summary statistics for all schools
##        q1           q2      q3
##  Min.   :-3.24771   1:409   1:295
##  Mean   :-0.01711   2:245   2:303
##  Max.   : 2.81883   3:175   3:163
##                     4:109   4:148
##  Stddev.: 1.01      5: 62   5: 91
##
##                     Prop.   Prop.
##                     1:0.409 1:0.295
##                     2:0.245 2:0.303
##                     3:0.175 3:0.163
##                     4:0.109 4:0.148
##                     5:0.062 5:0.091
##
##
##
##  Heterogeneous correlation matrix
## Warning in log(P): NaNs produced

## Warning in log(P): NaNs produced
##            q1         q2         q3
## q1 1.00000000 0.06946698 0.10336780
## q2 0.06946698 1.00000000 0.02009027
## q3 0.10336780 0.02009027 1.00000000
#0.07 0.1 0.02

set.seed(12334)
c4_1 <- cluster_gen_2(n4, n_W=1, cor_matrix = m4)
summarize_clusters(c4_1)
## ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Summary statistics for all schools
##        q1                  q2           q3
##  Min.   :-3.260349   Min.   :-2.87607   1:395
##  Mean   : 0.008743   Mean   :-0.01169   2:270
##  Max.   : 2.746657   Max.   : 2.97718   3:173
##                                         4:104
##  Stddev.: 1.03       Stddev.: 0.99      5: 58
##
##                                         Prop.
##                                         1:0.395
##                                         2:0.27
##                                         3:0.173
##                                         4:0.104
##                                         5:0.058
##
##
##
##  Heterogeneous correlation matrix
##           q1        q2        q3
## q1 1.0000000 0.1212883 0.1090388
## q2 0.1212883 1.0000000 0.1013691
## q3 0.1090388 0.1013691 1.0000000
# in total 3 variables with 1 categorical variable

m5 <- matrix(c(1, 0.55, 0.75, 0.3,
               0.55, 1, 0.15, 0.9,
               0.75, 0.15, 1, 0.25,
               0.3, 0.9, 0.25, 1), 4, 4)
set.seed(12334)
c5 <- cluster_gen_2(n6a, cor_matrix = m5)
## Error: Improper correlation matrix. Make sure all eigenvalues are non-negative
#Error: Improper correlation matrix. Make sure all eigenvalues are non-negative
wleoncio commented 3 years ago
  1. Error: Improper correlation matrix. Make sure all eigenvalues are non-negative

Addressed on f2ecbdca2542cdbea9fb0941a9254bdb9e6b72eb and 178f0f0e3871b26879012c3ac806b05aef6ab1f3.

  1. When “cor_matrix” is specified, the default variances are not explained.
  2. When setting cor_matrix independently, the number of variables is directly set as the number of rows. However, when setting “cor_matrix”and other arguments (“c_mean”, “sigma”) together, the number of variables will not be set directly. The number of continuous and categorical variables (n_X and n_W) should be specified, otherwise there will be an error.
  3. Error: n_X + n_W + theta must not be different from ncol(cor_matrix). The former add up to 8, whereas the latter equals 3 Maybe we could have more explanations for this error message in the manual.

Addressed on 3984c8a5bf725c297903214a10f64ac3a53ec514.

Setting the correlation matrix to m3 would lead to the following error: Error in n_W[[l]] : subscript out of bounds When I change the data structure to a simpler one, it would work. The cor_matrix seems to be incompatible with multi-level sample design.

Bug fixed on ccb9f5a981ff23055d77257eb6c0fc33da899d04 to allow for multilevel compatibility.