lebebr01 / simglm

Simulate regression models
https://simglm.brandonlebeau.org/
Other
43 stars 12 forks source link

Simulating data with balanced factor levels #106

Closed wjhopper closed 1 year ago

wjhopper commented 1 year ago

I suppose this is less of an issue than a question, but is it possible to simulate data from a balanced design? For instance, once of the examples in the Tidy Simulation with simglm vignette shows how to simulate a binary categorical variable for sex, but the simulated data ends up with 8 observations in the female category, and 2 observations in the male category.

library(simglm)
set.seed(321) 

sim_arguments <- list(
  formula = y ~ 1 + weight + age + sex,
  fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
               age = list(var_type = 'ordinal', levels = 30:60),
               sex = list(var_type = 'factor', levels = c('male', 'female'))),
  sample_size = 10
)

simulate_fixed(data = NULL, sim_arguments)
##    X.Intercept.   weight age sex_1    sex level1_id
## 1             1 231.1471  44     0 female         1
## 2             1 158.6388  38     0 female         2
## 3             1 171.6605  31     0 female         3
## 4             1 176.4105  40     0 female         4
## 5             1 176.2812  60     1   male         5
## 6             1 188.0455  47     0 female         6
## 7             1 201.8052  33     0 female         7
## 8             1 186.9941  43     1   male         8
## 9             1 190.1734  52     0 female         9
## 10            1 163.4426  31     0 female        10

Is it possible to force the simulated data to have 5 observations in the female category, and 5 observations in the male category?

lebebr01 commented 1 year ago

Thanks for reaching out and trying the package. I had not implemented this and have generally taken the approach that the sample sizes for the two groups would be equal in proportion, which is typically found in practice from my experience.

I've had multiple people ask for this, so I implemented a new argument for factor attribute simulation, force_equal = TRUE. The default is FALSE, not to adjust old code. This would be the new code using the example you posted from the vignette.


library(simglm)
set.seed(321) 

sim_arguments <- list(
  formula = y ~ 1 + weight + age + sex,
  fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
               age = list(var_type = 'ordinal', levels = 30:60),
               sex = list(var_type = 'factor', levels = c('male', 'female'),
                               force_equal = TRUE #this is the new code argument
                              )),
  sample_size = 10
)
simulate_fixed(data = NULL, sim_arguments)
wjhopper commented 1 year ago

Wonderful, thanks for adding this functionality! Might I suggest updating the vignettes to make users aware of this new functionality? I've prepared a pull request (#107) to do this if you think it's a good addition.