tidymodels / dials

Tools for creating tuning parameter values
https://dials.tidymodels.org/
Other
111 stars 26 forks source link

sample_size() doesn't seem to work within a tuning grid #111

Closed UnclAlDeveloper closed 4 years ago

UnclAlDeveloper commented 4 years ago

Using sample_size(c(1L, nrow(Data)) when trying to create a tuning_grid for xgboost gives an error "Error: sample_size should be within [0,1]", so presumably a percentage. sample_size(c(0, 1)) only uses the integer values of 0 and 1 as the sample sizes in the grid. I'm not quite sure what a sample_size of 0 means. Trying to stop it from converting them to integers by using sample_size(c(0.01, 0.99)) within the tuning grid gives an error that the values must be integers.

EmilHvitfeldt commented 4 years ago

If you want a percentage of the data you can use sample_prop() which works the same as sample_size() but as a proportion of the total sample.

UnclAlDeveloper commented 4 years ago

Emil, sample_prop seems to work. Thank you.

Although this issue doesn't seem major but should remain open as one needing fixing.

juliasilge commented 4 years ago

Here is a reprex demonstrating the problem:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.6 
#> ✓ dplyr     0.8.5      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.0      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.0      ✓ yardstick 0.0.6 
#> ✓ purrr     0.3.4
#> ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step()   masks stats::step()

xgb_spec <- boost_tree(sample_size = tune()) %>%
  set_mode("regression") %>%
  set_engine("xgboost")

car_boot <- bootstraps(mtcars, times = 5)

size_grid <- grid_regular(sample_size(c(1, nrow(mtcars))))
size_grid
#> # A tibble: 3 x 1
#>   sample_size
#>         <int>
#> 1           1
#> 2          16
#> 3          32

xbg_res <- xgb_spec %>%
  tune_grid(mpg ~ .,
            resamples = car_boot,
            grid = size_grid)
#> x Bootstrap1: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap1: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap2: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap2: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap3: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap3: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap4: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap4: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap5: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap5: model 3/3: Error: `sample_size` should be within [0,1].
xbg_res
#> # Bootstrap sampling 
#> # A tibble: 5 x 4
#>   splits          id         .metrics         .notes          
#>   <list>          <chr>      <list>           <list>          
#> 1 <split [32/12]> Bootstrap1 <tibble [2 × 4]> <tibble [2 × 1]>
#> 2 <split [32/14]> Bootstrap2 <tibble [2 × 4]> <tibble [2 × 1]>
#> 3 <split [32/10]> Bootstrap3 <tibble [2 × 4]> <tibble [2 × 1]>
#> 4 <split [32/11]> Bootstrap4 <tibble [2 × 4]> <tibble [2 × 1]>
#> 5 <split [32/12]> Bootstrap5 <tibble [2 × 4]> <tibble [2 × 1]>

Created on 2020-05-01 by the reprex package (v0.3.0)

juliasilge commented 4 years ago

Maybe we just need to change the documentation or clarify how this is used. In the table under Engine Details in the docs at ?boost_tree, sample_size is set up as the same argument as subsample, sample, and subsampling_rate, i.e. a proportion rather than a count.

topepo commented 4 years ago

How do that look?


Note that, for most engines to `boost_tree()`, the `sample_size` argument  is in
terms of the number of training set points. `xgboost` parameterizes this as the 
_proportion_ of training set samples. When using the `tune` or `dials` package, 
the `dials` `sample_prop()` function can be used. For example, using a 
parameter set: 

```{r xgb-update, eval = FALSE}
mod <- 
  boost_tree(sample_size = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# update the parameters using the `dials` function
mod_param <- 
  mod %>% 
  parameters() %>% 
  update(sample_size = sample_prop(c(0.4, 0.9)))
juliasilge commented 4 years ago

Slight edits here:

Note that, for most engines to `boost_tree()`, the `sample_size` argument is in
terms of the _number_ of training set points. The `xgboost` package parameterizes this as the 
_proportion_ of training set samples instead. When using the `tune` or `dials` packages, 
the `dials::sample_prop()` function can be used in that case. For example, using a 
parameter set: 
topepo commented 4 years ago

Fixed in tidymodels/parsnip#328 but the pkgdown since is now not working; the redirect to parsnip.tidymodels.org is not working.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.