tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 420 forks source link

`nest(.data, ...)` unquoting support of left hand side in name-variable pairs of `...` #769

Closed philipp-baumann closed 5 years ago

philipp-baumann commented 5 years ago

I find tidyr::nest() very useful and use it quite heavily in my modeling workflows. I mostly do so using wrappers and quasiquotation/tidyeval.

As of tidyr 1.0.0 the .key argument is depreciated and issues friendly warning. However, I couldn't figure out an equivalent solution to capture user input for the resulting nest list-column name without .key.

It seems that omitting .key would involve unquoting the left hand side provided in the new ... interface, for example unquoting data in c(data = c(x, y, z). The reprex is provided below the text.

It would be neat if nest() supported this in the near future. Are there any plans to realize this, in a similar way as := in dplyr::mutate()? Or is there a workaround that allows wrapping !!!cols into the equivalent of c(data = c(x, y, z)?

Thanks a lot in advance for some hints. Cheers, Philipp

``` r
library("tidyr")
library("tibble")

df <- tibble(
  a = 1:10,
  b = 1:10,
  c = 1:10, 
  d = list(rep(1:10, 10))
)

nest_wrapper <- function(.data, ..., .key = "data") {
  cols <- rlang::enquos(...)
  .key <- rlang::enquo(.key)

  tidyr::nest(.data = .data, !!!cols, .key = !!.key)
}

nest_wrapper(.data = df, a, b, .key = "refdata")
#> Warning: All elements of `...` must be named.
#> Did you want `refdata = c(a, b)`?
#> # A tibble: 10 x 3
#>        c d                  refdata
#>    <int> <list>      <list<df[,2]>>
#>  1     1 <int [100]>        [1 × 2]
#>  2     2 <int [100]>        [1 × 2]
#>  3     3 <int [100]>        [1 × 2]
#>  4     4 <int [100]>        [1 × 2]
#>  5     5 <int [100]>        [1 × 2]
#>  6     6 <int [100]>        [1 × 2]
#>  7     7 <int [100]>        [1 × 2]
#>  8     8 <int [100]>        [1 × 2]
#>  9     9 <int [100]>        [1 × 2]
#> 10    10 <int [100]>        [1 × 2]

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Ubuntu 18.04.2 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Zurich               
#>  date     2019-10-02                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  ! package     * version    date       lib
#>  P assertthat    0.2.1      2019-03-21 [?]
#>  P backports     1.1.4      2019-04-10 [?]
#>  P callr         3.3.1      2019-07-18 [?]
#>  P cli           1.1.0      2019-03-19 [?]
#>  P crayon        1.3.4      2017-09-16 [?]
#>  P desc          1.2.0      2018-05-01 [?]
#>    devtools      2.1.0      2019-07-06 [1]
#>  P digest        0.6.20     2019-07-04 [?]
#>  P dplyr         0.8.3      2019-07-04 [?]
#>  P evaluate      0.14       2019-05-28 [?]
#>  P fansi         0.4.0      2018-10-05 [?]
#>  P fs            1.3.1      2019-05-06 [?]
#>  P glue          1.3.1      2019-03-12 [?]
#>  P highr         0.8        2019-03-20 [?]
#>  P htmltools     0.3.6      2017-04-28 [?]
#>  P knitr         1.23       2019-05-18 [?]
#>  P lifecycle     0.1.0      2019-08-01 [?]
#>  P magrittr      1.5        2014-11-22 [?]
#>    memoise       1.1.0      2017-04-21 [1]
#>  P pillar        1.4.2      2019-06-29 [?]
#>  P pkgbuild      1.0.3      2019-03-20 [?]
#>  P pkgconfig     2.0.2      2018-08-16 [?]
#>    pkgload       1.0.2      2018-10-29 [1]
#>  P prettyunits   1.0.2      2015-07-13 [?]
#>  P processx      3.4.1      2019-07-18 [?]
#>  P ps            1.3.0      2018-12-21 [?]
#>  P purrr         0.3.2      2019-03-15 [?]
#>  P R6            2.4.0      2019-02-14 [?]
#>  P Rcpp          1.0.2      2019-07-25 [?]
#>    remotes       2.1.0      2019-06-24 [1]
#>  P rlang         0.4.0      2019-06-25 [?]
#>  P rmarkdown     1.13       2019-05-22 [?]
#>  P rprojroot     1.3-2      2018-01-03 [?]
#>    sessioninfo   1.1.1      2018-11-05 [1]
#>  P stringi       1.4.3      2019-03-12 [?]
#>  P stringr       1.4.0      2019-02-10 [?]
#>  P testthat      2.1.1      2019-04-23 [?]
#>  P tibble      * 2.1.3      2019-06-06 [?]
#>    tidyr       * 1.0.0.9000 2019-10-02 [1]
#>  P tidyselect    0.2.5      2018-10-11 [?]
#>    usethis       1.5.1      2019-07-04 [1]
#>  P utf8          1.1.4      2018-05-24 [?]
#>  P vctrs         0.2.0      2019-07-05 [?]
#>  P withr         2.1.2      2018-03-15 [?]
#>  P xfun          0.8        2019-06-25 [?]
#>  P yaml          2.2.0      2018-07-25 [?]
#>  P zeallot       0.1.0      2018-01-28 [?]
#>  source                          
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  standard (@1.2.0)               
#>  standard (@2.1.0)               
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  standard (@0.1.0)               
#>  CRAN (R 3.6.0)                  
#>  standard (@1.1.0)               
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  standard (@1.0.2)               
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  standard (@2.1.0)               
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  standard (@1.1.1)               
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  github (tidyverse/tidyr@e8c3f23)
#>  CRAN (R 3.6.0)                  
#>  standard (@1.5.1)               
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#>  CRAN (R 3.6.0)                  
#> 
#> [1] /media/ssd/nas-ethz/doktorat/projects/01_spectroscopy/46_swiss-ssl/renv/library/R-3.6/x86_64-pc-linux-gnu
#> [2] /tmp/RtmpOMib5x/renv-system-library
#> [3] /usr/lib/R/library
#> 
#>  P ── Loaded and on-disk path mismatch.

Created on 2019-10-02 by the reprex package (v0.3.0)

jennybc commented 5 years ago

You can already use := to allow unquoting on the LHS of the ... constructs in nest():

library(tidyr)

f <- function(df, vars_to_nest, new_col) {
  nest(df, !!new_col := {{ vars_to_nest }})
}

f(iris, -Species, "stuff")
#> # A tibble: 3 x 2
#>   Species             stuff
#>   <fct>      <list<df[,4]>>
#> 1 setosa           [50 × 4]
#> 2 versicolor       [50 × 4]
#> 3 virginica        [50 × 4]

Created on 2019-10-02 by the reprex package (v0.3.0.9000)

Is this 👆what you want to do?

BTW the new vignette In packages has some good background on related issues. It's possible an example of what we're doing in this thread should be added there 🤔

jennybc commented 5 years ago

If you want your wrapper to "feel" the same as nest()'s ..., then you can also use the "pass the dots" strategy.

This is a silly example but I just wanted to give the wrapper g() some bit of logic besides calling nest():

library(tidyr)

g <- function(df, ...) {
  names(df) <- tolower(names(df))
  nest(df, ...)
}

g(iris, stuff = -species)
#> # A tibble: 3 x 2
#>   species             stuff
#>   <fct>      <list<df[,4]>>
#> 1 setosa           [50 × 4]
#> 2 versicolor       [50 × 4]
#> 3 virginica        [50 × 4]
g(iris, petal = starts_with("petal"), sepal = starts_with("sepal"))
#> # A tibble: 3 x 3
#>   species             petal          sepal
#>   <fct>      <list<df[,2]>> <list<df[,2]>>
#> 1 setosa           [50 × 2]       [50 × 2]
#> 2 versicolor       [50 × 2]       [50 × 2]
#> 3 virginica        [50 × 2]       [50 × 2]

Created on 2019-10-02 by the reprex package (v0.3.0.9000)

philipp-baumann commented 5 years ago

Hi Jenny, thanks for your quick response and the suggestions :+1:

hmm solution with g() doesn't really apply here, and f() does not work with multiple columns to (de)select. nest_wrapper() was supposed to support multiple columns supplied by the user, which will be nested within a single list-column, using .key or new_col as in your example as name of the nested column.

The wrapper I had before did something as shown in vignette In packages (thanks for the hint):

library(tidyr)

nest_egg <- function(data, cols) {
  nest(data, egg = one_of(cols))
}

nest_egg(iris, c("Petal.Length", "Petal.Width", "Sepal.Length", "Sepal.Width"))
#> # A tibble: 3 x 2
#>   Species               egg
#>   <fct>      <list<df[,4]>>
#> 1 setosa           [50 × 4]
#> 2 versicolor       [50 × 4]
#> 3 virginica        [50 × 4]

Created on 2019-10-02 by the reprex package (v0.3.0)

The nest_wrapper() does not require double quotes in contrary to the example above, and I have a lot of columns with chemical reference values to nest (it's always nice to save some typing) ;-). In the example, egg is hard-coded. I'd find it nice to have a flexible quasiquotation solution.

I think it would be nice to have a tidy evaluation solution for a general use case like this for nest(). My example is a bit too big, so nest_wrapper() exemplifies minimal behavior. In terms of code, it was like this for a use case (all arguments before .key in the dots):

# Nest chemical reference data and sample group variables
spc_refdata_BDM <-
  spc_refdata_BDM_unnested %>%
  nest_keep_lcols(
    As_tot, B_AAE10, BS, Ca_AAE10, CaCO3, Cd_tot, carbon_percent,
    nitrogen_percent, CN, Corg, cTOC, cTOC_pool_20, TC_pool_20, TN_pool_20,
    TS, Cu_AAE10, DNA_Menge, Fe_AAE10, Humus, K_AAE10, K_AAE10_GRUD,
    KAKpot_cmol_kg, Mg_AAE10, Mg_AAE10_GRUD, Mn_AAE10, P_AAE10, P_AAE10_GRUD,
    Pb_tot, U_tot, Zn_AAE10, pH, RG_FE, w_gFP, Sand, Schluff, Ton,
    .key = "refdata") 
philipp-baumann commented 5 years ago

I now have a solution, great you mentioned that := is working here. I'm happy with using dplyr::vars()̀ (like this the function interface is even becoming a bit cleaner because nest_cols defines the intent, instead of less informative dots):

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble::tibble(
  a = 1:10,
  b = 1:10,
  c = 1:10, 
  d = list(rep(1:10, 10))
)

nest_wrapper <- function(.data, nest_cols, new_col = "data") {
  new_col <- rlang::enquo(new_col)
  nest_cols_nm <- purrr::map_chr(nest_cols, rlang::as_name)
  tidyr::nest(.data = .data, !!new_col := tidyselect::one_of(nest_cols_nm))
}

nest_wrapper(.data = df, nest_cols = vars(a, b, c), new_col = "refdata")
#> # A tibble: 1 x 2
#>   d                  refdata
#>   <list>      <list<df[,3]>>
#> 1 <int [100]>       [10 × 3]

Created on 2019-10-02 by the reprex package (v0.3.0)