Closed SondreNilsen closed 1 year ago
Thanks for reaching out. Yes, it is possible to generate unbalanced data. You should be able to do something like the following with the newest development version on GH. Note, I converted the example to have smaller sample sizes for testing, but you could modify these within the sample_size simulation argument and within the call to the runif()
functions.
The key in the 3 level situation is to have the number of sample sizes at level 2 be equal to the number of clusters at level 3, then have the same number of level 1 sample sizes as the sum of the level 2 clusters. Saying it another way, the first number in the "level2_ss" would represent the number of level 2 clusters that are associated with the first level 3 cluster (in the first example below, this is 2). Then, the level 1 sample size would generate the number of units associated within each of those two level 2 units. In the example below, this would be 41 and 13 units respectively.
set.seed(5)
level2_ss <- round(runif(40, min = 1, max = 4), 0)
level1_ss <- round(runif(sum(level2_ss), min = 2, max = 50), 0)
#simulation argument
sim_arguments <- list(
formula = y ~ 1 + x + (1 | muni_year) + (1 | municipality),
fixed = list(
y = list(var_type = 'continuous',
mean=0, sd=1,
var_level=1),
x = list(var_type = 'continuous',
mean=0, sd=1,
var_level=2)),
reg_weights = c(intercept = 0, x = .05),
error = list(variance = 1),
randomeffect = list(var2 = list(variance = 0.002556, var_level=2),
var3 = list(variance = 0.011449, var_level=3)),
replications = 10,
model_fit=list(model_function="lmer"),
extract_coefficients = TRUE,
power = list(alpha = .05),
sample_size = list(level1 = level1_ss,
level2 = level2_ss,
level3 = 40))
set.seed(123)
#simulate and view data
fixed_data <- simulate_fixed(data = NULL, sim_arguments)
head(fixed_data, n = 20)
This is an example of a two-level simulation unbalanced at level 1.
set.seed(5)
level1_ss <- round(runif(7, min = 2, max = 50), 0)
#simulation argument
sim_arguments <- list(
formula = y ~ 1 + x + (1 | muni_year),
fixed = list(
y = list(var_type = 'continuous',
mean=0, sd=1,
var_level=1),
x = list(var_type = 'continuous',
mean=0, sd=1,
var_level=2)),
reg_weights = c(intercept = 0, x = .05),
error = list(variance = 1),
randomeffect = list(var2 = list(variance = 0.002556, var_level=2)),
replications = 10,
model_fit=list(model_function="lmer"),
extract_coefficients = TRUE,
power = list(alpha = .05),
sample_size = list(level1 = level1_ss,
level2 = 7))
set.seed(123)
#simulate and view data
fixed_data <- simulate_fixed(data = NULL, sim_arguments)
head(fixed_data, n = 20)
Hi again, Due to a series of unfortunate events, I was not able to follow up the last time I asked this question. Sorry for that. Again, i truly appreciate your work with this package.
To provide a bit of context of what i am trying to achieve: I have a data set consisting of a series of repeated cross sectional surveys conducted annually (10 years). Each survey consist of a unique set of individuals. The individual responses are nested in year of participation and municipality of participation, yielding the following three-level design: individuals (level 1), nested in municipality years (
muni_year
; level 2), nested in municipalities (municipality
: level 3). Overall, the data consist of approximately 550,000 individuals, 1000 municipality years, and 400 municipalities. However, the design is unbalanced, as municipalities have participated to a varying degree (number of municipality years ranges from 1 to 4 per municipalities), and the number of participants also differs across municipality years (from about 200 to 5000).For this study, I am trying to simulate the power of a level 2 predictor (
x
) on a level 1 outcome variable (y
), assuming a standardized beta of 0.05. I think i have managed to correctly simulate the power assuming a balanced design (i.e equal number of participants per municipality year; equal number of municipality years pr municipalities). However, my understanding is that power also is sensitive to whether the design is balanced or not, so I am hoping to adjust the simulation to better account for this. Is there a way to adjust my code to specify, for instance, the range of number of participants per level 2 units, and range (and proportion) of municipality years (level 2) per municipalities (level 3) so that it better aligns with the data I have?