The output of unnesting differs depending on the dimensions of the data frame to be unnested. (For more context on when one might encounter this issue see #197)
library("dplyr")
library("tidyr")
library("purrr")
library("broom")
data <- data_frame(
name = c("Alex", "Alex", "Alex", "Tim", "Tim", "Tim"),
year = c(1990, 1991, 1992, 1990, 1991, 1992),
height = c(160, 165, 170, 120, 134, 150),
weight = c(50, 52, 53, 48, 48, 52)
)
data
## name year height weight
## (chr) (dbl) (dbl) (dbl)
## 1 Alex 1990 160 50
## 2 Alex 1991 165 52
## 3 Alex 1992 170 53
## 4 Tim 1990 120 48
## 5 Tim 1991 134 48
## 6 Tim 1992 150 52
# nest
data <- nest(data, year, height, weight)
data
## name data
## (chr) (chr)
## 1 Alex <tbl_df [3,3]>
## 2 Tim <tbl_df [3,3]>
Here we build two different linear models (model_A and model_B) that differ in that the former will estimate the intercept and the slope and the latter only the slope.
# build two different linear models
data <- data %>%
mutate(
model_A = map(.$data, ~lm(year ~ height, data = .)),
model_B = map(.$data, ~lm(year ~ height + 0, data = .))
)
data
## name data model_A model_B
## (chr) (chr) (chr) (chr)
## 1 Alex <tbl_df [3,3]> <S3:lm> <S3:lm>
## 2 Tim <tbl_df [3,3]> <S3:lm> <S3:lm>
When the data frame is unnested on a particular column, but contains multiple other nested columns, then there seems to be a different behaviour depending on the number of rows of the nested data frames in the column to be unnested.
In case there is more than one row (tidy_model_A - where we have a slope and intercept), the other nested columns are dropped. However, in case there is only one row (tidy_model_B), the other nested columns are not dropped.
tidy_model_A <- data %>%
mutate(tidy = map(model_A, tidy))
tidy_model_A
## name data model_A model_B tidy
## (chr) (chr) (chr) (chr) (chr)
## 1 Alex <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [2,5]>
## 2 Tim <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [2,5]>
tidy_model_A %>%
unnest(tidy)
## name term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Alex (Intercept) 1.958000e+03 9.557251e-12 2.048706e+14 3.107423e-15
## 2 Alex height 2.000000e-01 5.790501e-14 3.453933e+12 1.843174e-13
## 3 Tim (Intercept) 1.982036e+03 3.464698e-01 5.720659e+03 1.112843e-04
## 4 Tim height 6.656805e-02 2.562205e-03 2.598076e+01 2.449142e-02
tidy_model_B <- data %>%
mutate(tidy = map(model_B, tidy))
tidy_model_B
## name data model_A model_B tidy
## (chr) (chr) (chr) (chr) (chr)
## 1 Alex <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [1,5]>
## 2 Tim <tbl_df [3,3]> <S3:lm> <S3:lm> <data.frame [1,5]>
tidy_model_B %>%
unnest(tidy)
## name data model_A model_B term estimate std.error statistic p.value
## (chr) (chr) (chr) (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Alex <tbl_df [3,3]> <S3:lm> <S3:lm> height 12.05941 0.2074858 58.12160 0.0002958912
## 2 Tim <tbl_df [3,3]> <S3:lm> <S3:lm> height 14.66374 0.9394218 15.60932 0.0040791367
I wonder whether either tidy_model_A should be:
tidy_model_A
## name data model_A model_B term estimate std.error statistic p.value
## (chr) (chr) (chr) (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Alex <tbl_df [3,3]> <S3:lm> <S3:lm> (Intercept) 1.958000e+03 9.557251e-12 2.048706e+14 3.107423e-15
## 2 Alex <tbl_df [3,3]> <S3:lm> <S3:lm> height 2.000000e-01 5.790501e-14 3.453933e+12 1.843174e-13
## 3 Tim <tbl_df [3,3]> <S3:lm> <S3:lm> (Intercept) 1.982036e+03 3.464698e-01 5.720659e+03 1.112843e-04
## 4 Tim <tbl_df [3,3]> <S3:lm> <S3:lm> height 6.656805e-02 2.562205e-03 2.598076e+01 2.449142e-02
Or tidy_model_A should be:
tidy_model_B
## name term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Alex height 12.05941 0.2074858 58.12160 0.0002958912
## 2 Tim height 14.66374 0.9394218 15.60932 0.0040791367
Is this behaviour designed? Sometimes it is nice to test different models (e.g. fixing the intercept) and then this difference in behaviour makes it difficult to do it programmatically.
The output of unnesting differs depending on the dimensions of the data frame to be unnested. (For more context on when one might encounter this issue see #197)
Here we build two different linear models (
model_A
andmodel_B
) that differ in that the former will estimate the intercept and the slope and the latter only the slope.When the data frame is unnested on a particular column, but contains multiple other nested columns, then there seems to be a different behaviour depending on the number of rows of the nested data frames in the column to be unnested.
In case there is more than one row (
tidy_model_A
- where we have a slope and intercept), the other nested columns are dropped. However, in case there is only one row (tidy_model_B
), the other nested columns are not dropped.I wonder whether either
tidy_model_A
should be:Or
tidy_model_A
should be:Is this behaviour designed? Sometimes it is nice to test different models (e.g. fixing the intercept) and then this difference in behaviour makes it difficult to do it programmatically.