tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

nest/unnest pipeline breaks for empty tibbles #1393

Open lschneiderbauer opened 2 years ago

lschneiderbauer commented 2 years ago

When using a nest / unnest pipeline as follows with an non-empty tibble everything works as expected:

library(tidyr, warn.conflicts = F)
library(dplyr, warn.conflicts = F)

tibble(x=c(2,3), y=4) %>%
nest(data = x) %>%
# usually perform additional calculations here
unnest(data) %>%
select(x, y)
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     4
#> 2     3     4

Created on 2022-09-12 by the reprex package (v2.0.1)

However, when the same pipeline gets fed with an empty tibble, the columns don't expand properly anymore:

library(tidyr, warn.conflicts = F)
library(dplyr, warn.conflicts = F)

tibble(x=c(2,3), y=4) %>%
filter(1 == 2) %>%    # to simulate an empty tibble
nest(data = x) %>%
# usually perform additional calculations here
unnest(data) %>%
select(x, y)
#> Error in `select()`:
#> ! Can't subset columns that don't exist.
#> x Column `x` doesn't exist.

tibble(x=c(2,3), y=4) %>%
filter(1 == 2) %>%    # to simulate an empty tibble
nest(data = x) %>%
# usually perform additional calculations here
unnest(data)
#> # A tibble: 0 x 2
#> # ... with 2 variables: y <dbl>, data <???>
#> # i Use `colnames()` to see all variable names

Created on 2022-09-12 by the reprex package (v2.0.1)

Instead, the column "data" is still here.

This is painful since the same pipeline errs with an empty dataset when relying on the existence of the columns used for nesting later on and this case would always require special treatment.

DavisVaughan commented 2 years ago

In theory this would be fixed by making nest() return a list-of column rather than a list.

library(tidyr, warn.conflicts = F)
library(dplyr, warn.conflicts = F)

df <- tibble(x=c(2,3), y=4) %>%
  filter(1 == 2)

df %>%
  nest(data = x) %>%
  mutate(data = vctrs::as_list_of(data, .ptype = df[0, "x"])) %>%
  unnest(data)
#> # A tibble: 0 × 2
#> # … with 2 variables: y <dbl>, x <dbl>

However, we decided not to do this in this PR: https://github.com/tidyverse/tidyr/pull/1218

Here was a related attempt to do this and assess the revdeps: https://github.com/tidyverse/tidyr/pull/1209#issuecomment-965688948

We decided we were going to wait on https://github.com/r-lib/vctrs/pull/1231, since part of the reason this change would be painful is because right now list-of type coercion is a little too strict for practical usage (it would fix 2 of those revdeps). We will probably get that fix into vctrs 0.5.0, and then we can reconsider making nest() return list-ofs to fix this.