Classification task: how to extract feature importance per class ?

gse-cc-git commented 6 years ago

Hello, I know how to get an idea of feature importance for a global classification task, but I would like to know how to get the same kind of result as R::randomForest gives: an importance ranking per class. Thanks you

suiji commented 6 years ago

We don't build the importance matrix right now, but there is no reason it couldn't be done. The matrix has the importance scores, broken down by category in the case of classification.

Are you asking how to derive the matrix yourself using the current release? It can be done.

gse-cc-git commented 6 years ago

@suiji hello, Yes I am trying to understand how I could get that matrix (using the 0.1-9 version). Thanks.

suiji commented 6 years ago

At the risk of repeating what you may already know, this style of importance uses a procedure almost identical to verification. That is, out-of-bag prediction is performed over the training observations. The difference is that, for each predictor, a separate such prediction is performed in which that predictor's observations are permuted. The predictor's importance in this case is measured by the reduction in predictive accuracy incurred by the shuffling its entries: permuting the observations of an influential predictor should be highly deleterious for predictive accuracy. A single value, such as MSE, typically gauges predictive accuracy for regression, yielding a vector of per-predictor importances. For classification, each class presents its own misclassification rate, whence the matrix. You should be able to compute this using the Arborist, but would need to gather several pieces of the trained forest, such as the "bag". It might be more helpful if the Verification() utility were changed to accept the single-column-permuted matrix as an argument, without your having to wire all these things up yourself.

boro2013 commented 6 years ago

I'm interesting on how to get the permutation importance (Mean Decrease in Accuracy) for all the predictors used to train the Rborist model. I'm looking into the package but I didn't found any function for retrieving the matrix of importance. Have you any idea how to get it ?

Thanks in advance

suiji commented 6 years ago

Apologies for the delay; it has been a hectic week.

The permutation importance is not yet offered. It can be constructed in the manner outlined by the response to @gse-cc-git that precedes your question. Feel free to reply if the outline is not sufficiently clear.

Given recent interest, perhaps the time has come to include the feature. Someone requested it a couple of years ago, but there was no response when I requested a commitment to assist in testing.

gse-cc-git commented 6 years ago

Hi, Am I on correct track if I intend to use the Rborist::Validate function to get the Mean Decrease in Accuracy like this:

train a forest (without validation)
generate one dataset in which feature i is randomized
get the classwise misprediction rate from output object of Validation procedure using the forest
repeat for each other feature

?

Thanks

suiji commented 6 years ago

That's basically it, but with a few caveats:

You probably want to train with validation in step 1. That way you will have reference scores against which to gauge each predictor's score under permutation.
In each iteration of the loop embodied by steps 2-4 you will want to call a function very similar to Verification(), but not precisely this function. The problem is that Validate() is hard-coded to employ the training set. What you want is a function, say, OOBPredict(), which accepts any conforming set of observations as an argument. Thus, for example, Validate() would ultimately just become a wrapper invoking OOBPredict() on the training observations.

I fully intend to provide such an OOBPredict() function. It should go hand-in-hand with the cleanup of the front-end bridge currently underway. This is a badly-needed refactoring that should make it much easier to achieve some of the following goals:

Extension to additional front ends, such as NumPy.
Distributed training and prediction on compute clusters.
Support for other decision-tree utilities.
Long-term maintainability.

I can assist with an extemporaneous version of OOBPredict() in the meantime, but would prefer not to branch a temporary, ad-hoc version of the package.

gse-cc-git commented 6 years ago

Thanks ! I'll have to work on it. One thing anyway: I missed the difference between Validation and what you call Verification() ?

suiji commented 6 years ago

The earlier reply has been changed to say "validate". Thank you for catching that.

suiji commented 6 years ago

Rather than introduce a new command, Predict() now accepts argument an additional argument, OOB. When set to TRUE (the default is FALSE), prediction is done with respect to the bagged row information created during training. You should now be able to assess the effects of permuting columns according to all metrics availed by prediction. This has not been extensively tested. In particular, there is not yet a guard to ensure that "newdata" has the same number of rows as the bag.

gse-cc-git commented 6 years ago

Thank you ! I'll be testing that soon !

suiji commented 5 years ago

Closing, as the original issue appears to be addressed.

Please feel free to reopen if needed.

flippercy commented 4 years ago

Does the package have any function to generate the variable importance now? Thanks!

suiji commented 2 years ago

Yes, the impPermute option provides this functionality.

suiji / Arborist

Classification task: how to extract feature importance per class ? #40