mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)
http://mjskay.github.io/tidybayes
GNU General Public License v3.0
717 stars 59 forks source link

compose_data throws an error for nested data #159

Closed kalealex closed 5 years ago

kalealex commented 5 years ago

When trying to prepare a dataframe with a nested column, the tidybayes function compose_data throws an error. I'm guessing this means that compose_data is not set up to handle nested data which would be read into multidimensional data types in Stan. For example, lists of length m nested inside each of n rows of a dataframe might feed into any of the following data structures.

# possible ways of declaring n * m data in Stan
data {
  matrix[n, m] x;
  real[n, m] x;
  vector[m] x[n];
  row_vector[m] x[n];
}

Here's an illustrative example with some fake data.

Imagine an experiment where an observer is shown a set of dots arranged on a number line and asked to eyeball their mean. In this simple experiment, our independent variable is the position of each of our dots on each trial (i.e., two dimensions: trials * dots), and our responding variable is the participant's response on each trial (i.e., one dimension: trials). We can simulate responses as means with random noise added.

# create fake data
n_trials <- 100
n_dots <- 50
trial_mean <-c()
dots <- c()
resp <- c()
for (i in 1:n_trials) {
  trial_mean[i] <- runif(1, -5, 5)                  # the ground truth mean position for the dots on each trial
  dots[i] <- list(trial_mean[i] + rnorm(n_dots, 0, 2)) # simulate a list of n_dots dot positions on each trial
  resp[i] <- mean(x[[i]]) + rnorm(1, 0, 0.1)        # simulate noisy mean judgment on each trial
}
# turn into tibble
df <- data_frame(trial_mean, dots, resp)

Now we call the tidybayes function compose_data to prepare the data for modeling in Stan.

library(tidybayes)
compose_data(df)

However, this throws the following error.

Error in is.list(val) : argument is of length zero

Should compose_data be able to handle nested data like this?

I think this might be related to issue 157, but I think my example and description of the problem are a little clearer. Please let me know if you have any questions about this issue or the example provided.

mjskay commented 5 years ago

This should be fixed on the dev branch now (devtools::install_github("mjskay/tidybayes", ref = "dev")). Here are some minimal examples that should work with x as defined in the model you suggested:

As a list column

This example uses the format you suggested:

df = tibble(
  x = list(1:5, 2:6)
)

df
# A tibble: 2 x 1
  x        
  <list>   
1 <int [5]>
2 <int [5]>

Which now does what is expected with compose_data:

df %>%
  compose_data()
$x
$x[[1]]
[1] 1 2 3 4 5

$x[[2]]
[1] 2 3 4 5 6

$n
[1] 2

Note that the number of columns is not provided automatically because there's no sensible rule for automatically determining the name of that column. But, you can provide it easily by relying on the fact that additional variables passed to compose_data can refer to previously-defined variables, including those automatically generated by compose_data itself. Thus you can simply define m in terms of the number of elements in the first row of x (since the array in this case has the same number of elements in each row):

df %>%
  compose_data(m = length(x[[1]]))
$x
$x[[1]]
[1] 1 2 3 4 5

$x[[2]]
[1] 2 3 4 5 6

$n
[1] 2

$m
[1] 5

As a matrix column

The other option would be to define x as a matrix column. Matrix columns have the restriction that they must have the same number of rows as the data frame, and could be thought of as defining "sub-columns" within the larger data frame. The analog to the above example would be this:

df = tibble(
  x = t(matrix(c(1:5, 2:6), ncol = 2))
)

df
# A tibble: 2 x 1
  x[,1]  [,2]  [,3]  [,4]  [,5]
  <int> <int> <int> <int> <int>
1     1     2     3     4     5
2     2     3     4     5     6

Again, compose_data does what is expected but cannot auto-generate the column index m:

df %>%
  compose_data()
$x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    2    3    4    5    6

$n
[1] 2

But we can define it directly in terms of x:

df %>%
  compose_data(m = ncol(x))
$x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    2    3    4    5    6

$n
[1] 2

$m
[1] 5

Let me know if that does what you need or if there's anything else that might help with this use case.