trinker / wakefield

Generate random data sets
256 stars 28 forks source link

including "name" in r_data_frame results in error #18

Closed swestenb closed 7 years ago

swestenb commented 7 years ago

The following code produces an error:

df = r_data_frame(
  n = 500,
  name,
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died
)
View(df)

Error Produced: Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

This code however works just fine:

df = r_data_frame(
  n = 500,
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died
)
View(df)
swestenb commented 7 years ago

Update: Can be fixed by setting replace=TRUE when drawing names within the r_data_frame function. This code works:

df = r_data_frame(
  n = 500,
  name(replace=TRUE),
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died
)
View(df)
mattsigal commented 7 years ago

This is because the name vector included in the package is only of length 331. You can either use a smaller n or use name(replace=TRUE).

trinker commented 7 years ago

@swestenb @mattsigal I'd appreciate a pull request with a longer name vector. This issue comes up often.

mattsigal commented 7 years ago

@trinker, seems reasonable to expand that list. Looking at the documentation for the dataset, you would prefer them to be gender-neutral (but then again, looking at the dataset itself - I've never met a female Matthew or Walter)?

mattsigal commented 7 years ago

In https://github.com/trinker/wakefield/pull/19 I have provided a much more extensive list of names (length = 95025). These pertain to the unique entries found in the babynames package (https://cran.r-project.org/web/packages/babynames/).

trinker commented 7 years ago

@mattsigal Thanks for the PR! I'm closing this issue now.