matherealize / simdata

An R package for simulating data
https://matherealize.github.io/simdata/
7 stars 1 forks source link

Clustered data #5

Open AngelikaGeroldinger opened 3 years ago

AngelikaGeroldinger commented 3 years ago

I was just trying to use simdata to generate data with some of the variables constant within clusters (of fixed size). For my purposes the following function was sufficient (where n_obs is the number of observations, cor_mat is the correlation matrix and clustervar is an integer such that 1: clustervar are the indices of the variables constant within clusters):

cmvtnorm <- function(n_obs=100, cluster_size=4, cor_mat, clustervar=0){
  if (n_obs %% cluster_size != 0) {
    stop("n_obs is not divisble by cluster_size")
  }
  X <- matrix(rnorm(n_obs*ncol(cor_mat)), nrow=n_obs, ncol=ncol(cor_mat))
  if (clustervar>0) {
    X[, 1:clustervar] <- X[rep(1:(n_obs/cluster_size), each=cluster_size), 1:clustervar]
  }
  chol_cor <- chol(cor_mat)
  X <- X %*% chol_cor
  return(X)
}

cor_mat <- cor_from_upper(5,
                          rbind(c(1,2,0.5), c(1,3,0.5),
                                c(2,4,0.5), c(3,5,-0.3),
                                c(4,5,0.5) ))

test <- cmvtnorm(n_obs=100000, cluster_size=4, cor_mat=cor_mat, clustervar=3)

cor(test)
cor_mat

apply(test, 2, sd)
apply(test, 2, mean)
head(test, 20)

Maybe this could be a nice extension?

matherealize commented 3 years ago

Thanks for the suggestion, it definitely looks interesting. In general, a clustered data simulator could be implemented in the package.

I'm unsure about the "general" use-case to implement, however. In your function, the simulated data is not "hierarchical", as far as I understand, i.e. the clusters influence the individual observations only through the "mixed-level" correlation matrix of the initial mvtnorm data. Such a correlation matrix across clusters and individual observations (i.e. mixing levels of the hierarchy) is something I'm unsure how to motivate.

I'd rather implement something more closely resembling hierarchical data, i.e. draw clusters first, then individual observations within a cluster, possibly depending on the cluster parameters. But this seems a lot more involved and has many parameters for which I'm not sure how to provide sensible defaults.

But maybe I'm making it too complicated. Do you have any further insights here?