tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
341 stars 66 forks source link

How to pass a variable to group_vfold_cv's partition number argument #81

Closed htlin closed 2 years ago

htlin commented 5 years ago

Hi, I like to make a few nested_cv's based on the same partition configuration as follows:

outer_cv <- 5
inner_cv <- 4
sampling1 <- nested_cv(all_dataset,
                      outside = group_vfold_cv(v = outer_cv, group = "Rep"),
                      inside = group_vfold_cv(v = inner_cv, group = "Rep"))
sampling2 <- ...
...

However, I am getting

object 'outer_cv' not found

error, which is out of the scope for the group_vfold_cv function. Do you have any recommendations? Does tidy evaluation help in this case?

topepo commented 5 years ago

Can you dummy up a small data set and provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you!

If you've never heard of a reprex before, start by reading "What is a reprex", and follow the advice further down that page.

htlin commented 5 years ago

Yes sorry about that, here is the reprex:

library(rsample)
#> Loading required package: tidyr

run_experiment <- function(all_dataset) {
  outer_cv <- 5
  inner_cv <- 4
  sampling1 <- nested_cv(all_dataset,
                         outside = group_vfold_cv(v = outer_cv, group = "Rep"),
                         inside = group_vfold_cv(v = inner_cv, group = "Rep"))

  sampling2 <- nested_cv(all_dataset,
                         outside = group_vfold_cv(v = outer_cv, group = "Rep2"),
                         inside = group_vfold_cv(v = inner_cv, group = "Rep2"))
}

all_dataset <- matrix(nrow = 50, ncol = 5, 0) %>% as.data.frame
all_dataset$Rep <- 1:5
all_dataset$Rep2 <- 5:1
run_experiment(all_dataset)
#> Error in group_vfold_splits(data = data, group = group, v = v): object 'outer_cv' not found

Created on 2019-02-01 by the reprex package (v0.2.1)

DavisVaughan commented 5 years ago

In theory you should be able to do this with no problem, so I'd call it a bug. I think the environment could be captured (maybe with parent.frame()?) and then the eval() call could specify that as the environment.

Alternatively, it would probably be beneficial (and not too bad) to rewrite using quosures so we won't have to worry about the environments at all. The only weird thing would be inserting the data into the call.

DavisVaughan commented 5 years ago

In the meantime, if you want to program around it, you can do:

library(rsample)
#> Loading required package: tidyr
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill

run_experiment <- function(all_dataset) {
  outer_cv <- 5
  inner_cv <- 4

  sampling1_call <- rlang::expr(
    nested_cv(
      all_dataset,
      outside = group_vfold_cv(v = !!outer_cv, group = "Rep"),
      inside = group_vfold_cv(v = !!inner_cv, group = "Rep")
    )
  )

  sampling2_call <- rlang::expr(
    nested_cv(
      all_dataset,
      outside = group_vfold_cv(v = !!outer_cv, group = "Rep2"),
      inside = group_vfold_cv(v = !!inner_cv, group = "Rep2")
    )
  )

  sampling1 <- rlang::eval_tidy(sampling1_call)
  sampling2 <- rlang::eval_tidy(sampling2_call)

  list(sampling1, sampling2)
}

all_dataset <- matrix(nrow = 50, ncol = 5, 0) %>% as.data.frame
all_dataset$Rep <- 1:5
all_dataset$Rep2 <- 5:1
run_experiment(all_dataset)
#> [[1]]
#> [1] "nested_cv"      "group_vfold_cv" "rset"           "tbl_df"        
#> [5] "tbl"            "data.frame"    
#> # Nested resampling:
#> #  outer: Group 5-fold cross-validation
#> #  inner: Group 4-fold cross-validation
#> # A tibble: 5 x 3
#>   splits          id        inner_resamples 
#>   <list>          <chr>     <list>          
#> 1 <split [40/10]> Resample1 <tibble [4 × 2]>
#> 2 <split [40/10]> Resample2 <tibble [4 × 2]>
#> 3 <split [40/10]> Resample3 <tibble [4 × 2]>
#> 4 <split [40/10]> Resample4 <tibble [4 × 2]>
#> 5 <split [40/10]> Resample5 <tibble [4 × 2]>
#> 
#> [[2]]
#> [1] "nested_cv"      "group_vfold_cv" "rset"           "tbl_df"        
#> [5] "tbl"            "data.frame"    
#> # Nested resampling:
#> #  outer: Group 5-fold cross-validation
#> #  inner: Group 4-fold cross-validation
#> # A tibble: 5 x 3
#>   splits          id        inner_resamples 
#>   <list>          <chr>     <list>          
#> 1 <split [40/10]> Resample1 <tibble [4 × 2]>
#> 2 <split [40/10]> Resample2 <tibble [4 × 2]>
#> 3 <split [40/10]> Resample3 <tibble [4 × 2]>
#> 4 <split [40/10]> Resample4 <tibble [4 × 2]>
#> 5 <split [40/10]> Resample5 <tibble [4 × 2]>

Created on 2019-02-01 by the reprex package (v0.2.1.9000)

fbchow commented 5 years ago

What's the tidyeval equivalent of match.call() ? I tried using quo and eval_tidy but not sure how to find the environment's parents.

The actual value of outer_cv and inner_cv don't get picked up by match.call().
https://github.com/tidymodels/rsample/blob/775ac5559a477b39f1c23ef1380a7abb036d73fe/R/nest.R#L56-L58

So you can't evaluate the outside https://github.com/tidymodels/rsample/blob/775ac5559a477b39f1c23ef1380a7abb036d73fe/R/nest.R#L72

and inside https://github.com/tidymodels/rsample/blob/775ac5559a477b39f1c23ef1380a7abb036d73fe/R/nest.R#L96-L99

DavisVaughan commented 5 years ago

@fbchow it will probably use enquo() and eval_tidy() as you are saying. When you evaluate the quosure using eval_tidy(), it will evaluate the quosure in the environment that it was specified in (which I think is the parent that you are referring to).

The weirdness for this example is that we are going to have to modify the expression of the quosure using something like rlang::call_modify() before evaluating it. It will likely look something like this:

library(rlang)
library(rsample)
#> Warning: package 'rsample' was built under R version 3.5.2
#> Loading required package: tidyr
dat <- data.frame(x = c(1, 2))

outside <- rlang::quo(bootstraps(times = 5))
outside
#> <quosure>
#> expr: ^bootstraps(times = 5)
#> env:  global

outside_modified <- rlang::call_modify(outside, data = dat)
outside_modified
#> <quosure>
#> expr: ^bootstraps(times = 5, data = <data.frame>)
#> env:  global

eval_tidy(outside_modified)
#> # Bootstrap sampling 
#> # A tibble: 5 x 2
#>   splits        id        
#>   <list>        <chr>     
#> 1 <split [2/1]> Bootstrap1
#> 2 <split [2/1]> Bootstrap2
#> 3 <split [2/0]> Bootstrap3
#> 4 <split [2/1]> Bootstrap4
#> 5 <split [2/1]> Bootstrap5

Created on 2019-02-11 by the reprex package (v0.2.1.9000)

DavisVaughan commented 5 years ago

You can also use data = expr(dat) rather than data = dat which will embed the name dat into the call rather than the entire data frame there. It shouldn't make a big difference for this example though.

library(rlang)
library(rsample)
#> Warning: package 'rsample' was built under R version 3.5.2
#> Loading required package: tidyr
dat <- data.frame(x = c(1, 2))

outside <- rlang::quo(bootstraps(times = 5))
outside
#> <quosure>
#> expr: ^bootstraps(times = 5)
#> env:  global

outside_modified <- rlang::call_modify(outside, data = rlang::expr(dat))
outside_modified
#> <quosure>
#> expr: ^bootstraps(times = 5, data = dat)
#> env:  global

eval_tidy(outside_modified)
#> # Bootstrap sampling 
#> # A tibble: 5 x 2
#>   splits        id        
#>   <list>        <chr>     
#> 1 <split [2/0]> Bootstrap1
#> 2 <split [2/0]> Bootstrap2
#> 3 <split [2/0]> Bootstrap3
#> 4 <split [2/0]> Bootstrap4
#> 5 <split [2/0]> Bootstrap5

Created on 2019-02-11 by the reprex package (v0.2.1.9000)

juliasilge commented 2 years ago

At long last, this is now fixed:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(rsample)

run_experiment <- function(all_dataset) {
  outer_cv <- 5
  inner_cv <- 4
  sampling1 <- nested_cv(all_dataset,
                         outside = group_vfold_cv(v = outer_cv, group = "Rep"),
                         inside = group_vfold_cv(v = inner_cv, group = "Rep"))

  sampling2 <- nested_cv(all_dataset,
                         outside = group_vfold_cv(v = outer_cv, group = "Rep2"),
                         inside = group_vfold_cv(v = inner_cv, group = "Rep2"))

  list(sampling1, sampling2)
}

all_dataset <- matrix(nrow = 50, ncol = 5, 0) %>% as.data.frame()
all_dataset$Rep <- 1:5
all_dataset$Rep2 <- 5:1
run_experiment(tibble(all_dataset))
#> [[1]]
#> # Nested resampling:
#> #  outer: Group 5-fold cross-validation
#> #  inner: Group 4-fold cross-validation
#> # A tibble: 5 × 3
#>   splits          id        inner_resamples         
#>   <list>          <chr>     <list>                  
#> 1 <split [40/10]> Resample1 <group_vfold_cv [4 × 2]>
#> 2 <split [40/10]> Resample2 <group_vfold_cv [4 × 2]>
#> 3 <split [40/10]> Resample3 <group_vfold_cv [4 × 2]>
#> 4 <split [40/10]> Resample4 <group_vfold_cv [4 × 2]>
#> 5 <split [40/10]> Resample5 <group_vfold_cv [4 × 2]>
#> 
#> [[2]]
#> # Nested resampling:
#> #  outer: Group 5-fold cross-validation
#> #  inner: Group 4-fold cross-validation
#> # A tibble: 5 × 3
#>   splits          id        inner_resamples         
#>   <list>          <chr>     <list>                  
#> 1 <split [40/10]> Resample1 <group_vfold_cv [4 × 2]>
#> 2 <split [40/10]> Resample2 <group_vfold_cv [4 × 2]>
#> 3 <split [40/10]> Resample3 <group_vfold_cv [4 × 2]>
#> 4 <split [40/10]> Resample4 <group_vfold_cv [4 × 2]>
#> 5 <split [40/10]> Resample5 <group_vfold_cv [4 × 2]>

Created on 2021-11-18 by the reprex package (v2.0.1)

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.