Closed MechCoder closed 8 years ago
@betatim I would be more than happy if you could take this forward
This will involve wrapping sklearn's DecisionTrees to return the standard deviation of each leaf
This is not difficult to do since we keep the training data. Then it is just a matter of using DecisionTreeRegressor.apply
and derive the statistics we need.
Indeed, do you think it has a generic use case to be included in sklearn directly?
maybe, but I would not put la charrue avant les boeufs, as we say in french :)
Just a FYI, knowledge gained from @glouppe
Using ExtraTreeRegressors
might solve a part of a problem, because the splits are not limited to training data.
This is with regard to having piece wise constant predictions/smoother predictions that we can optimise with bfgs
?
Yes, this helps having a smoother decision surface. But in the finite case, the resulting surface will remain piece-wise constant... It still a bit mysterious to me how to optimize this kind of thing. @MechCoder Have you figured out how they optimize the acquisition function in SMAC?
I just read the paper again and they don't use gradient (or second order information) based approach. They do a local search technique similar to a popular one called ParamILS (Don't know if you people have heard of it before). I'm opening a new issue for that.
The new issue discussing this local search based optimisation is #74
For further reference to myself, see sections 2.2. and 2.3 of https://arxiv.org/pdf/1605.05537v2.pdf for approximating the variance.
One step close thanks to #89.
What is the current thinking on how to build a RF based optimiser? My feeling is that the procedure used by SMAC is a bit convoluted and adhoc. I would start investigating from the weighted mean. Potentially using the variance to combine the predictions (instead of just using the mean) and definitely the normal formula for combining the variances to obtain the overall variance.
For coordination purposes: @glouppe wanted to start working on this. So he is in charge 😀
Yes, i am currently exploring some more how to derive variance for forests. PR soon to follow (today or later this week).
I did some exploration at https://github.com/glouppe/notebooks/blob/master/Variance%20from%20tree-based%20models.ipynb
I was not convinced that the variance decomposition used in SMAC was correct, but I was wrong. It is indeed equivalent to the variance computed over the training samples gathered at leaves where test samples arrive (ie. std_v1 == std_v3
in notebook). Also using the variance of the predictions (std_v2
) is clearly wrong, as previously discussed.
At the end of the day, I think optimizing with ExtraTreesRegressor(n_estimators=100, min_samples_leaf=5)
should yield an accurate estimate of p(y|x). Random forests are really too noisy.
Adding RF-based optimization should now be trivial, except maybe for the optimization of the aquisition. @betatim Want to have a stab at it?
We could at least implement the same random sampling approach as for gbrt_optimize
.
Nods. I think we should even be able to rename gbrt_optimize
-> _tree_optimize
and then have gbrt_opt
and rf_opt
as public APIs that pass a different base_estimator
. Or do you think we should keep them separate?
Yes, might be worth factorizing those together, since we can know basically support gbrt, dt, rf and et-based optimization. Definitely not worth duplicating the code 4 times.
Good thing python now supports unicode variables, the ET based one needs to be called 👽_optimize()
😀
Yay!
For future archeologists #91 is the PR that implemented some of what was missing. Now the tuning starts!
The computed variance for each RandomForest is given in http://arxiv.org/pdf/1211.0906v2.pdf in section 4.3.2 (This will involve wrapping sklearn's DecisionTrees to return the standard deviation of each leaf)
The ExpectedImprovement makes the same assumption about the predictions being gaussian except there is a minor modification given in Section 5.2 of https://www.cs.ubc.ca/~murphyk/Papers/gecco09.pdf
There is a change from sklearn's RF implementation in computing the split point described in 4.3.2 in http://arxiv.org/pdf/1211.0906v2.pdf but we can try without that modification.