scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.78k stars 25.13k forks source link

Partial Dependence Plots for Random Forests. #4405

Closed Autodidact24 closed 5 years ago

Autodidact24 commented 9 years ago

Does scikit-learn have any capacity for partial dependence plots and associated data arrays for random forest analyses? I can find the plot for GradientBoostingRegressor here http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html.

Doing the same for RF outputs:

File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/partial_dependence.py", line 239, in plot_partial_dependence
    raise ValueError('gbrt has to be an instance of BaseGradientBoosting')
ValueError: gbrt has to be an instance of BaseGradientBoosting
DonBeo commented 9 years ago

I have the same problem. It would be nice to have partial dependence plot for random forest or extra trees

dchudz commented 9 years ago

Is there any particular reason this was implemented only for gradient boosted models? Seems like partial dependence plots should be a very general idea.

amueller commented 9 years ago

Not really I think. We should probably add a more general helper

olivermueller commented 8 years ago

I'm running into the same problem. It would be nice to have partial dependence plots for random forests (or any other classifier).

trevorstephens commented 8 years ago

I'll work on this one.

sniemi commented 8 years ago

+1, this would be very valuable. Any update?

trevorstephens commented 8 years ago

I expect to be pushing a WIP PR this weekend.

pauljacksonrodgers commented 8 years ago

The partial dependence can be computed efficiently for GBMs, but estimating the partial dependence for a generic model (by actually simulating predictions) would be slow without some sampling...is the suggestion here to use some efficient method to calculate the PDP efficiently on random forests and decision trees, or to implement a "naive" method to estimate the PDP for any model?

The latter is slow but has the added advantage of being able to accommodate a pipeline, too.

trevorstephens commented 8 years ago

@pauljacksonrodgers Yes, the partial plots on a generic model can take some time on 2D plots, but I should have a WIP PR up tomorrow that helps with an "estimated" option that runs rather quickly on any model. Almost there! A bit more involved than I originally expected, but one week late in open source is essentially really, really; really, ridiculously early :-)

current %timeit implementation on 1D for a GBM on the breast_cancer binary classification dataset (~500 obs):

recursion
1000 loops, best of 3: 983 µs per loop
exact
100 loops, best of 3: 4.98 ms per loop
estimated
1000 loops, best of 3: 567 µs per loop

and on 2D:

recursion
1000 loops, best of 3: 1.71 ms per loop
exact
10 loops, best of 3: 24.9 ms per loop
estimated
1000 loops, best of 3: 1.2 ms per loop

on boston dataset for 1D regression (similar shape):

recursion
1000 loops, best of 3: 957 µs per loop
exact
100 loops, best of 3: 3.23 ms per loop
estimated
1000 loops, best of 3: 483 µs per loop

and for 2D:

recursion
1000 loops, best of 3: 1.64 ms per loop
exact
100 loops, best of 3: 15.5 ms per loop
estimated
1000 loops, best of 3: 996 µs per loop

where "exact" is the predict_proba call you refer to as potentially slow (it is), "recursion" is the current implementation, and "estimated" is a little magic using dataset means. Stay tuned.

(where 1D/2D refers to the number of variables represented by the pdplot)

thommiano commented 7 years ago

Any status updates on this?

DonBeo commented 7 years ago

I think partial dependency plot should be a general function available for every class with a predict function.

brityboy commented 7 years ago

Yes please, partial dependence plots for random forests would be much appreciated

jnothman commented 7 years ago

@brityboy your comments may be welcome at #5653. @trevorstephens has put in a great effort there, and it's probably something you can play with now, but it's going to take more work in code, documentation and review.

darthdeus commented 6 years ago

Is there any update on this? I just ran into the same issue of

gbrt has to be an instance of BaseGradientBoosting

while trying to use xgboost. I might be missing something, but why would it matter what classifier is being used?

yanghe-huo commented 6 years ago

Same issue here.

amueller commented 6 years ago

@lucyhuo @darthdeus the current implementation in sklearn only works for the sklearn estimator. We are working on a general implementation at #5653

Gitman-code commented 6 years ago

Hi @amueller Just want to clarify current state. I am using 0.19.1-2 in python 3.5. I hit the check

if not isinstance(gbrt, BaseGradientBoosting):

on line 123 of sklearn\ensemble\partial_dependence.py when using BaggingClassifier which is definitely a sklearn estimator. So the current implementation seems to only work for BaseGradientBoosting not all sklearn estimators. Does #5653 intend to expand this to all estimators or all sklearn estimators? Is there a release plan?

jnothman commented 6 years ago

Hi Keith, I think it would be great if we could get #5653 merged. I think mostly what we need there is a big reviewing effort.​