trinker / wakefield

Generate random data sets
256 stars 28 forks source link

Multiple response values #7

Open Btibert3 opened 8 years ago

Btibert3 commented 8 years ago

Is it possible to generate values in a multi-select environment? This could apply to survey research (select all that apply), or graphs (edges between nodes of a certain type).

Below is a really hacky function that demonstrates this use-case in a hard-coded way

## helper function:  obviously not a production-quality function
build_multi = function(ids) {
  df_data = data.frame()
  for (i in 1:length(ids)) {
    ## randomize how many choices are made
    n_obs = sample(x=1:2, size=1, prob = c(.75, .25))
    ## what are the choices available in the multi-select
    cvals = c("BIZ","ARTS","SCIENCE","HEALTH","OTHER")
    vals = sample(x = cvals,
                  replace = FALSE, 
                  prob = c(.4,.2,.15,.2, .05), 
                  size = n_obs)
    tmp_df = data.frame(id = ids[i],
                        values = vals)
    df_data = dplyr::bind_rows(df_data, tmp_df)
  }
  return(df_data)
}
## generate a set of ids
my_df = r_data_frame(id = id, n=100)

and generate the data

## return a long dataset of multiselect options for given probabilities (which I hardcoded)
my_long = build_multi(my_df$id)

Above generates the dataset in a structure that I would need, but I wasn't sure if this already existed in the current package.