tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.39k stars 418 forks source link

Restore data frames generically #943

Open hadley opened 4 years ago

hadley commented 4 years ago

See existing work in #812, and see below for a list of functions that we needed to consider, and some thoughts on what form of genericity is needed. Goal is to make sure that data frame extensions return reasonable results in the absence of specific methods (and to make sure all needed functions are generic so that they can be extended when needed).

chop, unchop
pack, unpack
nest, unnest

separate, extract = append_df
hoist = append_df

complete = full_join + replace_na
drop_na = dplyr_row_slice
separate_rows = str_split + unchop
uncount = dplyr_row_slice + optional column removal
replace_na = dplyr_col_modify
expand = dplyr_reconstruct

pivot_longer = dplyr_reconstruct
pivot_wider = dplyr_reconstruct

# don't need to update superseded functions
gather, spread 
nest_legacy, unnest_legacy
DavisVaughan commented 2 years ago

Need to consider the sticky column case, like panelr.

Ideally we'd be like dplyr, and just forcibly make the assumption that [ with 1 argument i is going to return a data frame with length length(i).

I have a feeling that we are going to have to say: if you have sticky columns and a sticky [ method, you'll need to implement an S3 method for this generic specific to your package. Otherwise it should just work.

That would break packages like this (with sticky cols) until they add a method for these operations. But it isn't like it worked right to begin with.

library(tidyr)
library(panelr)

data("WageData")
wages <- panel_data(WageData, id = id, wave = t)

wages
#> # Panel data:    4,165 × 14
#> # entities:      id [595]
#> # wave variable: t [1, 2, 3, ... (7 waves)]
#>    id        t   exp   wks   occ   ind south  smsa    ms   fem union    ed   blk
#>    <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 1         1     3    32     0     0     1     0     1     0     0     9     0
#>  2 1         2     4    43     0     0     1     0     1     0     0     9     0
#>  3 1         3     5    40     0     0     1     0     1     0     0     9     0
#>  4 1         4     6    39     0     0     1     0     1     0     0     9     0
#>  5 1         5     7    42     0     1     1     0     1     0     0     9     0
#>  6 1         6     8    35     0     1     1     0     1     0     0     9     0
#>  7 1         7     9    32     0     1     1     0     1     0     0     9     0
#>  8 2         1    30    34     1     0     0     0     1     0     0    11     0
#>  9 2         2    31    27     1     0     0     0     1     0     0    11     0
#> 10 2         3    32    33     1     1     0     0     1     0     1    11     0
#> # … with 4,155 more rows, and 1 more variable: lwage <dbl>

# Sticky cols
wages <- wages["exp"]
wages
#> # Panel data:    4,165 × 3
#> # entities:      id [595]
#> # wave variable: t [1, 2, 3, ... (7 waves)]
#>    id        t   exp
#>    <fct> <dbl> <dbl>
#>  1 1         1     3
#>  2 1         2     4
#>  3 1         3     5
#>  4 1         4     6
#>  5 1         5     7
#>  6 1         6     8
#>  7 1         7     9
#>  8 2         1    30
#>  9 2         2    31
#> 10 2         3    32
#> # … with 4,155 more rows

# Meaning they come along for the ride here
chop(wages, exp)
#> New names:
#> * id -> id...1
#> * t -> t...2
#> * id -> id...3
#> * t -> t...4
#> # A tibble: 4,165 × 5
#>    id...1 t...2      id...3       t...4         exp
#>    <fct>  <dbl> <list<fct>> <list<dbl>> <list<dbl>>
#>  1 1          1         [1]         [1]         [1]
#>  2 1          2         [1]         [1]         [1]
#>  3 1          3         [1]         [1]         [1]
#>  4 1          4         [1]         [1]         [1]
#>  5 1          5         [1]         [1]         [1]
#>  6 1          6         [1]         [1]         [1]
#>  7 1          7         [1]         [1]         [1]
#>  8 2          1         [1]         [1]         [1]
#>  9 2          2         [1]         [1]         [1]
#> 10 2          3         [1]         [1]         [1]
#> # … with 4,155 more rows

# Genericity doesn't realllly work right
# In theory this should be a panel data frame, but reconstruct_tibble()
# took over since it inherits from grouped_df
tidyr::pack(wages, data = exp)
#> # A tibble: 4,165 × 3
#> # Groups:   id [595]
#>    id        t data$id    $t  $exp
#>    <fct> <dbl> <fct>   <dbl> <dbl>
#>  1 1         1 1           1     3
#>  2 1         2 1           2     4
#>  3 1         3 1           3     5
#>  4 1         4 1           4     6
#>  5 1         5 1           5     7
#>  6 1         6 1           6     8
#>  7 1         7 1           7     9
#>  8 2         1 2           1    30
#>  9 2         2 2           2    31
#> 10 2         3 2           3    32
#> # … with 4,155 more rows

Created on 2021-11-12 by the reprex package (v2.0.1)

hadley commented 2 years ago

Let's kick this down the road again.

DavisVaughan commented 5 days ago

See https://github.com/tidyverse/tidyr/issues/1556 for an example. reconstruct_tibble() drops the class through as_tibble(), which is currently the expected behavior.

iris |> 
  dplyr::as_tibble() |> 
  structure(class = c("pop_data", "tbl_df", "tbl", "data.frame")) |>  
  tidyr::drop_na()  |>   
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"