tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

variable behaviour when unnesting tables containing multiple nested columns #197

Closed cnjr2 closed 8 years ago

cnjr2 commented 8 years ago

This is an issue that could perhaps be split into two related parts, both are about nested tables:

In the following reproducible example I tried to give some context with a situation where one might encounter this. Specifically, in the example I try to summarise the output from different linear models that were based on different formulas. Perhaps there is a better way to do the same thing?

library("dplyr")
library("tidyr")
library("purrr")
library("broom")

data <- data_frame(
    name = c("Alex", "Alex", "Alex", "Tim", "Tim", "Tim"),
    year = c(1990, 1991, 1992, 1990, 1991, 1992),
    height = c(160, 165, 170, 120, 134, 150),
    weight = c(50, 52, 53, 48, 48, 52)
)

data
## name  year height weight
## (chr) (dbl)  (dbl)  (dbl)
## 1  Alex  1990    160     50
## 2  Alex  1991    165     52
## 3  Alex  1992    170     53
## 4   Tim  1990    120     48
## 5   Tim  1991    134     48
## 6   Tim  1992    150     52

# nest
data <- nest(data, year, height, weight)

data
## name           data
## (chr)          (chr)
## 1  Alex <tbl_df [3,3]>
## 2   Tim <tbl_df [3,3]>

Here we build two different linear models (model_A and model_B) that differ in that the former will estimate the intercept and the slope and the latter only the slope.

# build two different linear models
data <- data %>%
  mutate(
    model_A = map(.$data, ~lm(year ~ height, data = .)),
    model_B = map(.$data, ~lm(year ~ height + 0, data = .))
  )

data
## name           data model_A model_B
## (chr)          (chr)   (chr)   (chr)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm>
## 2   Tim <tbl_df [3,3]> <S3:lm> <S3:lm>

When the data frame is unnested on a particular column, but contains multiple other nested columns, then there seems to be a different behaviour depending on the number of rows of the nested data frames in the column to be unnested.

In case there is more than one row (tidy_model_A - where we have a slope and intercept), the other nested columns are dropped. However, in case there is only one row (tidy_model_B), the other nested columns are not dropped.

Is this behaviour designed? Sometimes it is nice to test different models (e.g. fixing the intercept) and then this difference in behaviour makes it difficult to do it programmatically.

# here it works as
tidy_model_A <- data %>%
  mutate(tidy = map(model_A, tidy)) %>%
  unnest(tidy)

tidy_model_A
##   name        term     estimate    std.error    statistic      p.value
##  (chr)       (chr)        (dbl)        (dbl)        (dbl)        (dbl)
## 1  Alex (Intercept) 1.958000e+03 9.557251e-12 2.048706e+14 3.107423e-15
## 2  Alex      height 2.000000e-01 5.790501e-14 3.453933e+12 1.843174e-13
## 3   Tim (Intercept) 1.982036e+03 3.464698e-01 5.720659e+03 1.112843e-04
## 4   Tim      height 6.656805e-02 2.562205e-03 2.598076e+01 2.449142e-02

tidy_model_B <- data %>%
  mutate(tidy = map(model_B, tidy)) %>%
  unnest(tidy)

tidy_model_B
## name           data model_A model_B   term estimate std.error statistic      p.value
## (chr)          (chr)   (chr)   (chr)  (chr)    (dbl)     (dbl)     (dbl)        (dbl)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm> height 12.05941 0.2074858  58.12160 0.0002958912
## 2   Tim <tbl_df [3,3]> <S3:lm> <S3:lm> height 14.66374 0.9394218  15.60932 0.0040791367

For the second part to this issue, when I gather the unnested data frame (tidy_model_A - where the other nested columns were dropped), I can spread again. However in case they were not dropped (in tidy_model_B), then the gather works, but the spreading breaks.

Is it expected that spreading should not work for data frames with nested columns?

tidy_model_A %>%
  gather(variable, value, estimate, std.error, statistic, p.value) %>%
  unite(variable, term, variable) %>%
  spread(variable, value)
## name (Intercept)_estimate (Intercept)_p.value (Intercept)_statistic (Intercept)_std.error height_estimate height_p.value height_statistic height_std.error
## (chr)                (dbl)               (dbl)                 (dbl)                 (dbl)           (dbl)          (dbl)            (dbl)            (dbl)
## 1  Alex             1958.000        3.107423e-15          2.048706e+14          9.557251e-12      0.20000000   1.843174e-13     3.453933e+12     5.790501e-14
## 2   Tim             1982.036        1.112843e-04          5.720659e+03          3.464698e-01      0.06656805   2.449142e-02     2.598076e+01     2.562205e-03

tidy_model_B %>%
  gather(variable, value, estimate, std.error, statistic, p.value) %>%
  unite(variable, term, variable) %>%
  spread(variable, value)

## Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
##   'x' must be atomic

Should this not return something like this:

tidy_model_B %>%
  gather(variable, value, estimate, std.error, statistic, p.value) %>%
  unite(variable, term, variable) %>%
  spread(variable, value)

## name             data   model_A  model_B height_estimate height_p.value height_statistic height_std.error
## (chr)            (chr)    (chr)    (chr)         (dbl)          (dbl)            (dbl)            (dbl)
## 1  Alex <tbl_df [3,3]>  <S3:lm>  <S3:lm>      0.20000000   1.843174e-13     3.453933e+12     5.790501e-14
## 2   Tim <tbl_df [3,3]>  <S3:lm>  <S3:lm>      0.06656805   2.449142e-02     2.598076e+01     2.562205e-03

My sessionInfo() is below:

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
##
## other attached packages:
## [1] broom_0.4.0 purrr_0.2.1 tidyr_0.4.1 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4     lattice_0.20-33 psych_1.6.4     assertthat_0.1  grid_3.2.3      R6_2.1.2        plyr_1.8.3      nlme_3.1-128    DBI_0.4-1       magrittr_1.5    stringi_1.0-1   lazyeval_0.1.10
## [13] reshape2_1.4.1  tools_3.2.3     stringr_1.0.0   parallel_3.2.3  mnormt_1.5-4   
hadley commented 8 years ago

Would you mind refiling this as two separate issues? And for future reference, please don't include the session info unless it's requested.

cnjr2 commented 8 years ago

I have split the issues (now #198 and #199). Thanks!