Closed UnclAlDeveloper closed 4 years ago
If you want a percentage of the data you can use sample_prop()
which works the same as sample_size()
but as a proportion of the total sample.
Emil, sample_prop seems to work. Thank you.
Although this issue doesn't seem major but should remain open as one needing fixing.
Here is a reprex demonstrating the problem:
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.6
#> ✓ dplyr 0.8.5 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.0 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
xgb_spec <- boost_tree(sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
car_boot <- bootstraps(mtcars, times = 5)
size_grid <- grid_regular(sample_size(c(1, nrow(mtcars))))
size_grid
#> # A tibble: 3 x 1
#> sample_size
#> <int>
#> 1 1
#> 2 16
#> 3 32
xbg_res <- xgb_spec %>%
tune_grid(mpg ~ .,
resamples = car_boot,
grid = size_grid)
#> x Bootstrap1: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap1: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap2: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap2: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap3: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap3: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap4: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap4: model 3/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap5: model 2/3: Error: `sample_size` should be within [0,1].
#> x Bootstrap5: model 3/3: Error: `sample_size` should be within [0,1].
xbg_res
#> # Bootstrap sampling
#> # A tibble: 5 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [32/12]> Bootstrap1 <tibble [2 × 4]> <tibble [2 × 1]>
#> 2 <split [32/14]> Bootstrap2 <tibble [2 × 4]> <tibble [2 × 1]>
#> 3 <split [32/10]> Bootstrap3 <tibble [2 × 4]> <tibble [2 × 1]>
#> 4 <split [32/11]> Bootstrap4 <tibble [2 × 4]> <tibble [2 × 1]>
#> 5 <split [32/12]> Bootstrap5 <tibble [2 × 4]> <tibble [2 × 1]>
Created on 2020-05-01 by the reprex package (v0.3.0)
Maybe we just need to change the documentation or clarify how this is used. In the table under Engine Details in the docs at ?boost_tree
, sample_size
is set up as the same argument as subsample
, sample
, and subsampling_rate
, i.e. a proportion rather than a count.
How do that look?
Note that, for most engines to `boost_tree()`, the `sample_size` argument is in
terms of the number of training set points. `xgboost` parameterizes this as the
_proportion_ of training set samples. When using the `tune` or `dials` package,
the `dials` `sample_prop()` function can be used. For example, using a
parameter set:
```{r xgb-update, eval = FALSE}
mod <-
boost_tree(sample_size = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
# update the parameters using the `dials` function
mod_param <-
mod %>%
parameters() %>%
update(sample_size = sample_prop(c(0.4, 0.9)))
Slight edits here:
Note that, for most engines to `boost_tree()`, the `sample_size` argument is in
terms of the _number_ of training set points. The `xgboost` package parameterizes this as the
_proportion_ of training set samples instead. When using the `tune` or `dials` packages,
the `dials::sample_prop()` function can be used in that case. For example, using a
parameter set:
Fixed in tidymodels/parsnip#328 but the pkgdown since is now not working; the redirect to parsnip.tidymodels.org
is not working.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Using sample_size(c(1L, nrow(Data)) when trying to create a tuning_grid for xgboost gives an error "Error:
sample_size
should be within [0,1]", so presumably a percentage. sample_size(c(0, 1)) only uses the integer values of 0 and 1 as the sample sizes in the grid. I'm not quite sure what a sample_size of 0 means. Trying to stop it from converting them to integers by using sample_size(c(0.01, 0.99)) within the tuning grid gives an error that the values must be integers.