trinker / wakefield

Generate random data sets
256 stars 28 forks source link

Multiple time measurements #1

Closed mrdwab closed 9 years ago

mrdwab commented 9 years ago

It would be nice for those of us who are lazy to have convenient names for repeated measures in a wide format.

Consider:

r_data_frame(
  n = 5,
  id, 
  race, race, race, 
  age, age, age
)
# Source: local data frame [5 x 7]
# 
#   ID     Race   Race.1   Race.2 Age Age.1 Age.2
#1  1 Hispanic    White Hispanic  30    30    32
#2  2    White Hispanic    White  31    20    30
#3  3    White    White    White  26    23    25
#4  4    White    Black    White  20    30    31
#5  5    Asian    White    White  20    28    24

Generally, the preferred form would be to have all "times" identified. Thus, at the very minimum, Race should become Race.0 for balance in the naming scheme.

I know I can just do:

r_data_frame(
  n = 5,
  id, 
  Race_1 = race, Race_2 = race, Race_3 = race, 
  Age_1 = age, Age_2 = age, Age_3 = age
)
# Source: local data frame [5 x 7]
# 
#   ID   Race_1   Race_2 Race_3 Age_1 Age_2 Age_3
#1  1    White Hispanic  White    24    30    23
#2  2 Hispanic    White  Black    32    35    32
#3  3    White    White  White    28    21    25
#4  4    White    White  White    33    22    24
#5  5    White    Black  White    31    30    21

But that's a lot of extra typing :-(


I haven't dug into your code (hence raising an issue and not a pull request), but it's possible that the fix might be something as easy as:

r_data_frame <- function (n, ...) 
{
  out <- r_list(n = n, ...)
  temp <- names(out)
  temp <- ave(temp, temp, FUN = function(x) 
    if (length(x) == 1) x else paste(x, seq_along(x), sep = "_"))
  out <- setNames(data.frame(out, stringsAsFactors = FALSE, 
                             check.names = FALSE), temp)
  dplyr::tbl_df(out)
}
mrdwab commented 9 years ago

I guess a related feature would be to specify some variables that would be grouped in a balanced way. For instance, ID and state might not be grouped, but "age" and "sex" might be with two measurements each.

Perhaps there is better syntax, but I'm imagining something like:

r_data_frame(3, id, state, Grouped(2, age, sex))
#   ID          State Age_1 Age_2  Sex_1 Sex_2
# 1  1   Pennsylvania    21    23 Female  Male
# 2  2 South Carolina    29    26 Female  Male
# 3  3        Florida    30    20   Male  Male

Sorry--no psuedo code to achieve this yet :-)

trinker commented 9 years ago

@mrdwab Thanks for the feedback. That gets my brain flowing a bit too. Maybe a repeated measures function that takes a function and the number of times to repeat it and names accordingly. Something like...

Never mind...

As I read on I see that's what you proposed with Grouped :+1:

trinker commented 9 years ago

@mrdwab I added made the first switch to better named columns using your fix. It was simple. Thanks for the suggestion.

I also r_series (like your Grouped but forces one function rather than several) + r_dummy (inspired by r_series) .

The next step is to get this working within r_list and r_data_frame so they recognize the data.frame out puts and act accordingly. I think it should be fairly straight forward but am out of time for the day.

Though it already is pretty close for r_list, probably just name the element with the race vector as "Race".

r_list(n=5,
    r_series(race, 4),
    age
)
$X1
Source: local data frame [5 x 4]

    Race_1   Race_2 Race_3   Race_4
1    White    White  White    White
2    White    White  White    White
3    White    White  White Hispanic
4    Black Hispanic  White Hispanic
5 Hispanic    White  White    White

$Age
[1] 32 33 34 26 30
trinker commented 9 years ago

@mrdwab Again thanks for the suggestions. I have added these features, which can be seen demoed in the README.

I've added you as a contributor on the package as well. Great suggestions.

trinker commented 9 years ago

Here's a quickie demo:

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_install_gh("trinker/wakefield"); p_load("wakefield")

r_data_frame(
    n = 5,
    id, 
    race, race, race, 
    age, age, age
)

## Source: local data frame [5 x 7]
## 
##   ID Race_1   Race_2   Race_3 Age_1 Age_2 Age_3
## 1  1  White Hispanic Hispanic    25    21    31
## 2  2  White    White    White    20    35    20
## 3  3  White    White    White    21    26    33
## 4  4  Black Hispanic    White    34    33    33
## 5  5  Black    White    White    21    28    28

r_data_frame(3, 
    id, 
    state, 
    r_series(likert, 4, integer = TRUE),
    r_series(age, 2),
    r_dummy(sex)
)

## Source: local data frame [3 x 10]
## 
##   ID   State Likert_1 Likert_2 Likert_3 Likert_4 Age_1 Age_2 Male Female
## 1  1 Indiana        1        5        1        1    24    33    1      0
## 2  2    Iowa        3        1        2        5    26    29    1      0
## 3  3 Florida        3        3        2        3    21    30    0      1
mrdwab commented 9 years ago

Lookin' good :-) :+1: