Closed Autodidact24 closed 5 years ago
I have the same problem. It would be nice to have partial dependence plot for random forest or extra trees
Is there any particular reason this was implemented only for gradient boosted models? Seems like partial dependence plots should be a very general idea.
Not really I think. We should probably add a more general helper
I'm running into the same problem. It would be nice to have partial dependence plots for random forests (or any other classifier).
I'll work on this one.
+1, this would be very valuable. Any update?
I expect to be pushing a WIP PR this weekend.
The partial dependence can be computed efficiently for GBMs, but estimating the partial dependence for a generic model (by actually simulating predictions) would be slow without some sampling...is the suggestion here to use some efficient method to calculate the PDP efficiently on random forests and decision trees, or to implement a "naive" method to estimate the PDP for any model?
The latter is slow but has the added advantage of being able to accommodate a pipeline, too.
@pauljacksonrodgers Yes, the partial plots on a generic model can take some time on 2D plots, but I should have a WIP PR up tomorrow that helps with an "estimated" option that runs rather quickly on any model. Almost there! A bit more involved than I originally expected, but one week late in open source is essentially really, really; really, ridiculously early :-)
current %timeit implementation on 1D for a GBM on the breast_cancer
binary classification dataset (~500 obs):
recursion
1000 loops, best of 3: 983 µs per loop
exact
100 loops, best of 3: 4.98 ms per loop
estimated
1000 loops, best of 3: 567 µs per loop
and on 2D:
recursion
1000 loops, best of 3: 1.71 ms per loop
exact
10 loops, best of 3: 24.9 ms per loop
estimated
1000 loops, best of 3: 1.2 ms per loop
on boston
dataset for 1D regression (similar shape):
recursion
1000 loops, best of 3: 957 µs per loop
exact
100 loops, best of 3: 3.23 ms per loop
estimated
1000 loops, best of 3: 483 µs per loop
and for 2D:
recursion
1000 loops, best of 3: 1.64 ms per loop
exact
100 loops, best of 3: 15.5 ms per loop
estimated
1000 loops, best of 3: 996 µs per loop
where "exact" is the predict_proba
call you refer to as potentially slow (it is), "recursion" is the current implementation, and "estimated" is a little magic using dataset means. Stay tuned.
(where 1D/2D refers to the number of variables represented by the pdplot)
Any status updates on this?
I think partial dependency plot should be a general function available for every class with a predict function.
Yes please, partial dependence plots for random forests would be much appreciated
@brityboy your comments may be welcome at #5653. @trevorstephens has put in a great effort there, and it's probably something you can play with now, but it's going to take more work in code, documentation and review.
Is there any update on this? I just ran into the same issue of
gbrt has to be an instance of BaseGradientBoosting
while trying to use xgboost
. I might be missing something, but why would it matter what classifier is being used?
Same issue here.
@lucyhuo @darthdeus the current implementation in sklearn only works for the sklearn estimator. We are working on a general implementation at #5653
Hi @amueller Just want to clarify current state. I am using 0.19.1-2 in python 3.5. I hit the check
if not isinstance(gbrt, BaseGradientBoosting):
on line 123 of sklearn\ensemble\partial_dependence.py when using BaggingClassifier which is definitely a sklearn estimator. So the current implementation seems to only work for BaseGradientBoosting not all sklearn estimators. Does #5653 intend to expand this to all estimators or all sklearn estimators? Is there a release plan?
Hi Keith, I think it would be great if we could get #5653 merged. I think mostly what we need there is a big reviewing effort.
Does scikit-learn have any capacity for partial dependence plots and associated data arrays for random forest analyses? I can find the plot for GradientBoostingRegressor here http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html.
Doing the same for RF outputs: