tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

Unnesting behaviour depends on dimension of data frame to be unnested #198

Closed cnjr2 closed 8 years ago

cnjr2 commented 8 years ago

The output of unnesting differs depending on the dimensions of the data frame to be unnested. (For more context on when one might encounter this issue see #197)

library("dplyr")
library("tidyr")
library("purrr")
library("broom")

data <- data_frame(
    name = c("Alex", "Alex", "Alex", "Tim", "Tim", "Tim"),
    year = c(1990, 1991, 1992, 1990, 1991, 1992),
    height = c(160, 165, 170, 120, 134, 150),
    weight = c(50, 52, 53, 48, 48, 52)
)

data
## name  year height weight
## (chr) (dbl)  (dbl)  (dbl)
## 1  Alex  1990    160     50
## 2  Alex  1991    165     52
## 3  Alex  1992    170     53
## 4   Tim  1990    120     48
## 5   Tim  1991    134     48
## 6   Tim  1992    150     52

# nest
data <- nest(data, year, height, weight)

data
## name           data
## (chr)          (chr)
## 1  Alex <tbl_df [3,3]>
## 2   Tim <tbl_df [3,3]>

Here we build two different linear models (model_A and model_B) that differ in that the former will estimate the intercept and the slope and the latter only the slope.

# build two different linear models
data <- data %>%
  mutate(
    model_A = map(.$data, ~lm(year ~ height, data = .)),
    model_B = map(.$data, ~lm(year ~ height + 0, data = .))
  )

data
## name           data model_A model_B
## (chr)          (chr)   (chr)   (chr)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm>
## 2   Tim <tbl_df [3,3]> <S3:lm> <S3:lm>

When the data frame is unnested on a particular column, but contains multiple other nested columns, then there seems to be a different behaviour depending on the number of rows of the nested data frames in the column to be unnested.

In case there is more than one row (tidy_model_A - where we have a slope and intercept), the other nested columns are dropped. However, in case there is only one row (tidy_model_B), the other nested columns are not dropped.


tidy_model_A <- data %>%
  mutate(tidy = map(model_A, tidy))

tidy_model_A
##   name           data model_A model_B               tidy
##  (chr)          (chr)   (chr)   (chr)              (chr)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [2,5]>
## 2   Tim <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [2,5]>

tidy_model_A %>%
  unnest(tidy)
##   name        term     estimate    std.error    statistic      p.value
##  (chr)       (chr)        (dbl)        (dbl)        (dbl)        (dbl)
## 1  Alex (Intercept) 1.958000e+03 9.557251e-12 2.048706e+14 3.107423e-15
## 2  Alex      height 2.000000e-01 5.790501e-14 3.453933e+12 1.843174e-13
## 3   Tim (Intercept) 1.982036e+03 3.464698e-01 5.720659e+03 1.112843e-04
## 4   Tim      height 6.656805e-02 2.562205e-03 2.598076e+01 2.449142e-02

tidy_model_B <- data %>%
  mutate(tidy = map(model_B, tidy))

tidy_model_B
## name           data model_A model_B               tidy
## (chr)          (chr)   (chr)   (chr)              (chr)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [1,5]>
## 2   Tim <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [1,5]>

tidy_model_B %>%
  unnest(tidy)
## name           data model_A model_B   term estimate std.error statistic      p.value
## (chr)          (chr)   (chr)   (chr)  (chr)    (dbl)     (dbl)     (dbl)        (dbl)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm> height 12.05941 0.2074858  58.12160 0.0002958912
## 2   Tim <tbl_df [3,3]> <S3:lm> <S3:lm> height 14.66374 0.9394218  15.60932 0.0040791367

I wonder whether either tidy_model_A should be:


tidy_model_A
##   name          data  model_A model_B        term     estimate    std.error    statistic      p.value
##  (chr)          (chr)    (chr)   (chr)      (chr)        (dbl)        (dbl)        (dbl)        (dbl)
## 1  Alex <tbl_df [3,3]> <S3:lm> <S3:lm> (Intercept) 1.958000e+03 9.557251e-12 2.048706e+14 3.107423e-15
## 2  Alex <tbl_df [3,3]> <S3:lm> <S3:lm>      height 2.000000e-01 5.790501e-14 3.453933e+12 1.843174e-13
## 3   Tim <tbl_df [3,3]> <S3:lm> <S3:lm> (Intercept) 1.982036e+03 3.464698e-01 5.720659e+03 1.112843e-04
## 4   Tim <tbl_df [3,3]> <S3:lm> <S3:lm>      height 6.656805e-02 2.562205e-03 2.598076e+01 2.449142e-02

Or tidy_model_A should be:


tidy_model_B
## name    term estimate std.error statistic      p.value
## (chr)   (chr)    (dbl)     (dbl)     (dbl)        (dbl)
## 1  Alex  height 12.05941 0.2074858  58.12160 0.0002958912
## 2   Tim  height 14.66374 0.9394218  15.60932 0.0040791367

Is this behaviour designed? Sometimes it is nice to test different models (e.g. fixing the intercept) and then this difference in behaviour makes it difficult to do it programmatically.

hadley commented 8 years ago

See the .drop argument - this is a deliberate heuristic that I think will mostly be helpful.

cnjr2 commented 8 years ago

perfect, thanks!