tidymodels / dials

Tools for creating tuning parameter values
https://dials.tidymodels.org/
Other
113 stars 27 forks source link

grid_latin_hypercube() varies eventhough size is set. #196

Closed nvelden closed 2 years ago

nvelden commented 2 years ago

When I execute the below code multiple times I get grid sizes ranging from 8-10 even though 10 is set

dials::grid_latin_hypercube(
  x = mtry(c(1L,15L), trans = NULL),
  size=10)

# A tibble: 9 x 1
   mtry
  <int>
1     7
2     5
3     6
4     2
5    14
6    11
7     4
8     9
9    13

# A tibble: 10 x 1
    mtry
   <int>
 1     5
 2    11
 3    14
 4    12
 5     4
 6     6
 7     7
 8     8
 9     1
10    13

# A tibble: 8 x 1
   mtry
  <int>
1     8
2     5
3    10
4    14
5     9
6     2
7     4
8    11
mattwarkentin commented 2 years ago

@nvelden I believe this is the expected behaviour. I think duplicate sets of hyperparameters (i.e. rows) are removed to avoid redundancy. Since you are only using a single parameter with 15 possible options and choosing 10, there is higher likelihood of duplicates that will be removed before the tibble is returned.

See below, this occurs more often when less hyperparameter combinations are available, but sometimes even happens when there are many possible choices.

library(dials)
purrr::map_int(5:100, ~ nrow(grid_latin_hypercube(mtry(c(1, .x)), size = 10)))
#>  [1]  5  6  6  7  8  8  7 10  8 10  8 10 10 10  9 10  9 10 10 10 10 10 10 10 10
#> [26] 10 10 10  9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10  9 10 10 10
#> [51]  9 10 10 10 10 10  9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10  9
#> [76] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
nvelden commented 2 years ago

@mattwarkentin Thanks a lot that explains it. It doesn't make sense to me though. You would think duplicates are replaced and not removed when size is set and the search space allows for it.

mattwarkentin commented 2 years ago

I tend to agree with you - though I don't know quite enough about the mechanics of the latin hypercube algorithm. But I do agree that it seems sensible that if the search space supports size unique sets, that perhaps this many should always be returned. Maybe @topepo or @hfrick can provide some further insights.

nvelden commented 2 years ago

It might also lead to false conclusions when comparing tuning grids.

mattwarkentin commented 2 years ago

This may be a bit of an edge case, though. Where one is tuning over a small number of parameters with a small range of possible values. Collisions are much less likely when tuning over many parameters with lots of values. If you only have 15 possible combinations and need 10, then maybe grid_max_entropy() or grid_regular() (or even expand_grid()) is fine, where the expected number is always returned; latin hypercube might be overkill.

nvelden commented 2 years ago

grid_max_entropy() also returns different size grids when I try:

dials::grid_max_entropy(
  x = mtry(c(1L,15L), trans = NULL),
  size=13)

or

dials::grid_max_entropy(
  x = mtry(c(1L,15L), trans = NULL),
  y = mtry(c(1L,15L), trans = NULL),
  size=100)

In my case I am trying to plot the intermediate results for each trial on a plot. It makes it quite difficult to make a comparison if the grid and thus the trials that are run always differs in size.

Just one other question. Does grid search only work with whole numbers? How could I expand the search space by including numbers with one digit?

topepo commented 2 years ago

The way that these functions work are different from how regular or random grids are created. We translate the range of parameter values to a [0, 1] scale (which is what the DiceDesign package wants) and then let it give us a design back (also in [0, 1] units). For example:

set.seed(1)
DiceDesign::lhsDesign(n = 10, dimension = 1)
#> $n
#> [1] 10
#> 
#> $dimension
#> [1] 1
#> 
#> $design
#>             [,1]
#>  [1,] 0.60972652
#>  [2,] 0.84502707
#>  [3,] 0.25130192
#>  [4,] 0.09017551
#>  [5,] 0.75690209
#>  [6,] 0.33872284
#>  [7,] 0.58371673
#>  [8,] 0.17438072
#>  [9,] 0.99639421
#> [10,] 0.42200070
#> 
#> $randomized
#> [1] TRUE
#> 
#> $seed
#> [1] 1638200604

Created on 2021-11-29 by the reprex package (v2.0.0)

To translate back, we back-transform the scaled values back to their true range and truncation the values for integer or qualitative parameters. The last bit is what leads to the loss of design points.

The problem is abated when there are multiple parameters (even if they are all integers) and doesn't come up at all for real-valued parameters

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
tidymodels_prefer()

ex <- parameters(mtry(c(1, 10)), min_n(c(1, 10)))

set.seed(1)
map_int(1:100, ~ nrow(grid_latin_hypercube(ex, size = 10))) %>% 
  table()
#> .
#>  9 10 
#>  6 94

Created on 2021-11-29 by the reprex package (v2.0.0)

We could scale up size and trim the results to get the right number of points. It's not a great idea since it isn't foolproof and the trimming would negatively affect the design criteria.

That said, we should improve the documentation to let people know that size is the maximum size that the should expect for latin hypercube and maximum entropy designs (and that they use random numbers).

nvelden commented 2 years ago

I am usually working with the Python Optuna package for hyperparameter tuning. They make use of several suggest functions to define the search space:

    param = {
        "objective": "binary",
        "metric": "binary_logloss",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
    }

Maybe I missed it completely but is there a similar option in mtry() to suggest floats?

topepo commented 2 years ago

All of the tree ensemble methods, apart from xgboost, use integers for mtry. We have predictor_prop() and that is the fractional parameterization. However, that doesn't solve the issue since you eventually convert it to an integer. You would have duplicate mtry values and we use a distinct() on the grid to remove duplicate tuning parameter combinations.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.