hyperopt / hyperopt-sklearn

Hyper-parameter optimization for sklearn
hyperopt.github.io/hyperopt-sklearn
Other
1.57k stars 270 forks source link

max_depth=None setting overridden in components._trees_hp_space #182

Closed B0Gec closed 2 years ago

B0Gec commented 2 years ago

I would like to specify search space of random forest without depth limit (as max_depth=None in sklearn.ensemble.RandomForestRegressor) via components.random_forest_regression function but when specifying random_forest_regression(max_depth=None) this setting is overridden with search space hp.pchoice(name, [ (0.7, None), (0.1, 2), (0.1, 3), (0.1, 4), ]) since the lines (in components):

335 def _trees_max_depth(name):
336     return hp.pchoice(name, [
337         (0.7, None),  # most common choice.
338         # Try some shallow trees.
339         (0.1, 2),
340         (0.1, 3),
341         (0.1, 4),
342    ])
... # and lines (in `components._trees_hp_space function`):
757        max_depth=(_trees_max_depth(name_func('max_depth'))
758                   if max_depth is None else max_depth),

So I propose PR with lines (inside _trees_hp_space):

742        max_depth="unspecified",
...
757        max_depth=(_trees_max_depth(name_func('max_depth'))
758                   if max_depth is "unspecified" else max_depth),

TL;DR:

Currently components.random_forest_regression(max_depth=None):

5  max_depth =
6   switch
7     hyperopt_param
8       Literal{RF.rfr_max_depth}
9       categorical
10        pos_args
11           Literal{0.7}
12           Literal{0.1}
13           Literal{0.1}
14           Literal{0.1}
15     Literal{None}
16     Literal{2}
17     Literal{3}
18     Literal{4}

Expected:

5 max_depth =
6   Literal{None}
mandjevant commented 2 years ago

Thank you for your issue! I have implemented this in #183 and will close this when it gets merged.

B0Gec commented 2 years ago

Nice! That was a quick response and nice work with #183.

The above described situation happens also in more general, whenever hpsklearn's value None coincides with some optional value of corresponding scikit function. E.g. in components._trees_hp_space you have random_state=None but this is also possible value or same argument in sklearn.ensemble.RandomForestRegressor. Second example is in your new rewrite_for_sklearn1.0.0 branch in function components.ensemble.forest._forest_hp_space the argument max_leaf_nodes=None, since it coincides with the default parameter of Random Forest function in scikit. So there may be more possible improvements of this kind.

mandjevant commented 2 years ago

Fixing the random_state parameter makes much sense in most applications.

  1. It ensures that the splits that are generated are reproducible.
  2. Most importantly, it ensures that the improvements are actual improvements and not improvements by pure chance. For a more in-depth explanation on this I would suggest checking out this towardsdatascience article by Ler Wei Han.

I will take a look at the ensemble optimizers and implement your solution in the next few days. Thanks again for raising concern on this.

mandjevant commented 2 years ago

Now that it's merged I'll go ahead and close this. Thank you for raising an issue and feel free to let us know if you find any bugs or other issues.