lebebr01 / simglm

Simulate regression models
https://simglm.brandonlebeau.org/
Other
43 stars 12 forks source link

How to simulate factors with more than 2 levels #77

Closed Radibor closed 3 years ago

Radibor commented 4 years ago

Hi, All the examples of the vignette only simulate factors with 2 levels. Is it possible to simulate factors with more than 2 levels?

`sim_arguments <- list( formula = y ~ 1 + turnover+ type, fixed = list(turnover = list(var_type = 'continuous', mean = 10, sd = 3), type = list(var_type = 'factor', levels = c('A','B','C'), prob = c(.2,.6,.2))), error = list(variance = 0.1), sample_size = 100, reg_weights = c(2, 0.01, 0.5), outcome_type = 'poisson' )

dat <- simulate_fixed(data = NULL, sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments)`

does not work.

Also, I wonder if you could give an example of how to specify reg_weights for factors with more than 2 levels.

SimonKarg commented 3 years ago

Any new developments here? Or potential workarounds? Currently having the same issue :)

lebebr01 commented 3 years ago

Thanks for reaching out. I know where the error is in the code and have a possible solution to fix this bug, but haven't had the time to dedicate to this recently. I plan to spend significant time with this package soon to update in anticipation of another project that will build off this package.

I hope to have a fix pushed within the month to GitHub and shortly after to CRAN.

lebebr01 commented 3 years ago

Finally circling back to this, the developmental version on GH currently should address this issue from the commits referenced above. This is the code to adjust this, basically when you have a factor type attribute that you wish to simulate with more than 2 levels, you need to specify the number of categories - 1 terms to the reg_weights vector. For example, the following modifies the code to generate the data from your example.

sim_arguments <- list(
    formula = y ~ 1 + turnover+ type,
    fixed = list(turnover = list(var_type = 'continuous', mean = 10, sd = 3),
                 type = list(var_type = 'factor', levels = c('A','B','C'), prob = c(.2,.6,.2))),
    error = list(variance = 0.1),
    sample_size = 100,
    reg_weights = c(2, 0.01, 0.5, 0),
    outcome_type = 'poisson'
)

dat <- simulate_fixed(data = NULL, sim_arguments) %>%
    simulate_error(sim_arguments) %>%
    generate_response(sim_arguments)

Notice the extra number to the reg_weights argument (I added a 0, meaning group C would not be different from group A here on average.

I plan to add more documentation to this soon within the vignettes.