scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
457 stars 122 forks source link

Support feature importance #127

Closed mehdidc closed 8 years ago

mehdidc commented 8 years ago

http://i.imgur.com/AJ1wlo8.png

So what do you think of this API ? feature_importance_type can either be a string or a list of strings to specify several criteria, then after the importances are calculated they are available at featureimportances which is an array (like scikit learn) if one criterion is requested, otherwise it is a dict where the keys are criteria names and values are arrays.

Also, if pruning is disable it raises an exception because the computation of importances are done in the pruning pass, but I think we could untie them. What do you think?

jcrudy commented 8 years ago

@mehdidc I think this looks fantastic! I tried changing your example to use smoothing and missing data and there were no problems.

Regarding the API, I think it makes a lot of sense given the conventions of sklearn which we are trying to conform to as well as possible. Have you talked to any other sklearn people about the use of dictionaries to store the different importance vectors? I think it's fine and probably they have no precedent for it, but I wonder if there is some meta-estimator in sklearn that will break on this. Even if so, I think it's okay for now and users can just choose which one they want or do a little hacking to make it work.

As far as having feature importance without pruning, this could be done by modifying the pruning pass to select the zeroeth iteration (the one with all terms still in the model). It's a bit hackish, but I think this would be accomplished by choosing penalty=0 (assuming no numerical issues). In any case, I think it's totally reasonable to require pruning to have feature importance for now and for 0.1, unless you are motivated to untie them before 0.1.

I like your example. It's encouraging that the results for rss and gcv match the random forest results so closely. It's a little surprising that the nb_subsets results are so different, but it makes sense that they would be if you think about how they're calculated. I wonder what would be the motivation for using nb_subsets over gcv or rss? Perhaps if you don't believe the scale of your response variable is meaningful?

I think this is a very useful feature. I actually would have liked to have had this for a project I did last week, and I'll probably go back and redo it with this. I'm going to merge it into master now, and if you find you want to change anything later we can still do so.