tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
341 stars 67 forks source link

Error: `x` must be a vector, not a `rsplit/vfold_split` object #123

Closed gcameron89777 closed 4 years ago

gcameron89777 commented 4 years ago

Error: x must be a vector, not a rsplit/vfold_split object

I am experiencing the above error when using tidyr::crossing() just after creativng a rsplit object using vfold_cv(). The error is intermittent, it happens sometimes. Others have been able to reproduce, sometimes.

Example csv file to reproduce.

library(tidyverse)
library(rsample)

example_data <- read_csv("example_data.csv")

example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)

# 5 fold split stratified on j
set.seed(123)
train_cv <- vfold_cv(training_data, 5, strata = j) %>% 

  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))

blah <- train_cv %>% 
  crossing(mtry = c(1,2))
> Error: `x` must be a vector, not a `rsplit/vfold_split` object

train_cv looks like this:

train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

I would like to use the same train_cv object in my script for trying different models with their own tuning parameters. In the example above, if crossing(mtry = c(1, 2)) works, the desired output would take train_cv and make it look like this:

# A tibble: 10 x 5
   splits            id    train                  validate                mtry
   <named list>      <chr> <named list>           <named list>           <dbl>
 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     1
 2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     2
 3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2

Session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux 2

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Metrics_0.1.4     rsample_0.0.5     rlang_0.4.2       odbc_1.2.2        DBI_1.1.0         dbplyr_1.4.2      rmarkdown_2.0     kableExtra_1.1.0 
 [9] scales_1.1.0      doParallel_1.0.15 iterators_1.0.12  foreach_1.4.7     lubridate_1.7.4   forcats_0.4.0     stringr_1.4.0     dplyr_0.8.3      
[17] purrr_0.3.3       readr_1.3.1       tidyr_1.0.0       tibble_2.1.3      ggplot2_3.2.1     tidyverse_1.3.0   tufte_0.5        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        lattice_0.20-38   listenv_0.8.0     utf8_1.1.4        assertthat_0.2.1  zeallot_0.1.0     digest_0.6.23     packrat_0.5.0    
 [9] R6_2.4.1          cellranger_1.1.0  backports_1.1.5   reprex_0.3.0      evaluate_0.14     httr_1.4.1        pillar_1.4.3      lazyeval_0.2.2   
[17] readxl_1.3.1      data.table_1.12.8 rstudioapi_0.10   furrr_0.1.0       blob_1.2.0        webshot_0.5.2     bit_1.1-15.1      munsell_0.5.0    
[25] broom_0.5.3       compiler_3.6.0    modelr_0.1.5      xfun_0.12         pkgconfig_2.0.3   globals_0.12.5    htmltools_0.4.0   tidyselect_0.2.5 
[33] codetools_0.2-16  future_1.16.0     fansi_0.4.1       viridisLite_0.3.0 crayon_1.3.4      withr_2.1.2       grid_3.6.0        nlme_3.1-143     
[41] jsonlite_1.6      gtable_0.3.0      lifecycle_0.1.0   magrittr_1.5      cli_2.0.1         stringi_1.4.5     fs_1.3.1          xml2_1.2.2       
[49] generics_0.0.2    vctrs_0.2.1       tools_3.6.0       bit64_0.9-7       glue_1.3.1        hms_0.5.3         colorspace_1.4-1  rvest_0.3.5      
[57] knitr_1.27        haven_2.2.0      

Not sure if this is an actual issue or a problem with my code. I tried the rstudio community forum first.

gacolitti commented 4 years ago

Here is perhaps a simpler reprex:

library(rsample)
#> Loading required package: tidyr
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mtcars %>% 
  vfold_cv(10) %>% 
  crossing(x = c(1, 2, 3))
#> `x` must be a vector, not a `rsplit/vfold_split` object

mtcars %>% 
  vfold_cv(10) %>% 
  nest(data = c(id)) %>% 
  unnest(cols = c(data)) %>% 
  crossing(x = c(1, 2, 3))
#> `x` must be a vector, not a `rsplit/vfold_split` object

mtcars %>% 
  vfold_cv(10) %>% 
  group_by(id) %>% 
  nest() %>% 
  unnest(cols = c(data)) %>% 
  crossing(x = c(1, 2, 3))
#> # A tibble: 30 x 3
#>    id     splits             x
#>    <chr>  <list>         <dbl>
#>  1 Fold01 <split [28/4]>     1
#>  2 Fold01 <split [28/4]>     2
#>  3 Fold01 <split [28/4]>     3
#>  4 Fold02 <split [28/4]>     1
#>  5 Fold02 <split [28/4]>     2
#>  6 Fold02 <split [28/4]>     3
#>  7 Fold03 <split [29/3]>     1
#>  8 Fold03 <split [29/3]>     2
#>  9 Fold03 <split [29/3]>     3
#> 10 Fold04 <split [29/3]>     1
#> # ... with 20 more rows

Created on 2020-02-13 by the reprex package (v0.3.0)

For some reason grouping, nesting, and unnesting works.

topepo commented 4 years ago

I'm guessing that it was an issue with the version of tidyr that you were using. It works for me (see below).

Two things though:

library(tidyverse)
library(rsample)

example_data <- read_csv("~/Downloads/example_data.csv")
#> Parsed with column specification:
#> cols(
#>   a = col_double(),
#>   b = col_double(),
#>   c = col_double(),
#>   d = col_double(),
#>   e = col_double(),
#>   f = col_double(),
#>   g = col_double(),
#>   h = col_double(),
#>   i = col_double(),
#>   j = col_logical()
#> )

example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)

# 5 fold split stratified on j
set.seed(123)
train_cv <- vfold_cv(training_data, 5, strata = j) 

# Before unpacking: 
lobstr::obj_size(train_cv)
#> 8,289,640 B

train_cv <- 
  train_cv %>% 
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))

# After unpacking: 
lobstr::obj_size(train_cv)
#> 42,499,296 B

blah <- train_cv %>% 
  crossing(mtry = c(1,2))
blah
#> # A tibble: 10 x 5
#>    splits            id    train                  validate                mtry
#>    <named list>      <chr> <named list>           <named list>           <dbl>
#>  1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     1
#>  2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     2
#>  3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
#>  4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
#>  5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
#>  6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
#>  7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
#>  8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
#>  9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
#> 10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2

lobstr::obj_size(blah)
#> 42,498,352 B

Created on 2020-03-29 by the reprex package (v0.3.0)

gcameron89777 commented 4 years ago

Thank you Max!

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.