scikit-optimize / scikit-optimize

Sequential model-based optimization with a `scipy.optimize` interface

https://scikit-optimize.github.io

BSD 3-Clause "New" or "Revised" License

2.74k stars 545 forks source link

Implement RF based model selection #57

Closed MechCoder closed 8 years ago

MechCoder commented 8 years ago

The computed variance for each RandomForest is given in http://arxiv.org/pdf/1211.0906v2.pdf in section 4.3.2 (This will involve wrapping sklearn's DecisionTrees to return the standard deviation of each leaf)

The ExpectedImprovement makes the same assumption about the predictions being gaussian except there is a minor modification given in Section 5.2 of https://www.cs.ubc.ca/~murphyk/Papers/gecco09.pdf

There is a change from sklearn's RF implementation in computing the split point described in 4.3.2 in http://arxiv.org/pdf/1211.0906v2.pdf but we can try without that modification.

MechCoder commented 8 years ago

@betatim I would be more than happy if you could take this forward

glouppe commented 8 years ago

This will involve wrapping sklearn's DecisionTrees to return the standard deviation of each leaf

This is not difficult to do since we keep the training data. Then it is just a matter of using DecisionTreeRegressor.apply and derive the statistics we need.

MechCoder commented 8 years ago

Indeed, do you think it has a generic use case to be included in sklearn directly?

glouppe commented 8 years ago

maybe, but I would not put la charrue avant les boeufs, as we say in french :)

MechCoder commented 8 years ago

Just a FYI, knowledge gained from @glouppe

Using ExtraTreeRegressors might solve a part of a problem, because the splits are not limited to training data.

betatim commented 8 years ago

This is with regard to having piece wise constant predictions/smoother predictions that we can optimise with bfgs?

glouppe commented 8 years ago

Yes, this helps having a smoother decision surface. But in the finite case, the resulting surface will remain piece-wise constant... It still a bit mysterious to me how to optimize this kind of thing. @MechCoder Have you figured out how they optimize the acquisition function in SMAC?

MechCoder commented 8 years ago

I just read the paper again and they don't use gradient (or second order information) based approach. They do a local search technique similar to a popular one called ParamILS (Don't know if you people have heard of it before). I'm opening a new issue for that.

betatim commented 8 years ago

The new issue discussing this local search based optimisation is #74

glouppe commented 8 years ago

For further reference to myself, see sections 2.2. and 2.3 of https://arxiv.org/pdf/1605.05537v2.pdf for approximating the variance.

betatim commented 8 years ago

One step close thanks to #89.

What is the current thinking on how to build a RF based optimiser? My feeling is that the procedure used by SMAC is a bit convoluted and adhoc. I would start investigating from the weighted mean. Potentially using the variance to combine the predictions (instead of just using the mean) and definitely the normal formula for combining the variances to obtain the overall variance.

For coordination purposes: @glouppe wanted to start working on this. So he is in charge 😀

glouppe commented 8 years ago

Yes, i am currently exploring some more how to derive variance for forests. PR soon to follow (today or later this week).

glouppe commented 8 years ago

I did some exploration at https://github.com/glouppe/notebooks/blob/master/Variance%20from%20tree-based%20models.ipynb

I was not convinced that the variance decomposition used in SMAC was correct, but I was wrong. It is indeed equivalent to the variance computed over the training samples gathered at leaves where test samples arrive (ie. std_v1 == std_v3 in notebook). Also using the variance of the predictions (std_v2) is clearly wrong, as previously discussed.

glouppe commented 8 years ago

At the end of the day, I think optimizing with ExtraTreesRegressor(n_estimators=100, min_samples_leaf=5) should yield an accurate estimate of p(y|x). Random forests are really too noisy.

glouppe commented 8 years ago

90 is merged!

Adding RF-based optimization should now be trivial, except maybe for the optimization of the aquisition. @betatim Want to have a stab at it?

glouppe commented 8 years ago

We could at least implement the same random sampling approach as for gbrt_optimize.

betatim commented 8 years ago

Nods. I think we should even be able to rename gbrt_optimize -> _tree_optimize and then have gbrt_opt and rf_opt as public APIs that pass a different base_estimator. Or do you think we should keep them separate?

glouppe commented 8 years ago

Yes, might be worth factorizing those together, since we can know basically support gbrt, dt, rf and et-based optimization. Definitely not worth duplicating the code 4 times.

betatim commented 8 years ago

Good thing python now supports unicode variables, the ET based one needs to be called 👽_optimize() 😀

MechCoder commented 8 years ago

Yay!

betatim commented 8 years ago

For future archeologists #91 is the PR that implemented some of what was missing. Now the tuning starts!