suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

rank features by their importance #28

Closed mjuarezm closed 7 years ago

mjuarezm commented 7 years ago

Is there a way to rank the features by their importance after testing the random forest?

I'm using the python wrapper. In case this is possible in the R or C++ implementations, how difficult would be to port it to pyborist?

suiji commented 7 years ago

The feature weights are tracked by the Core, so should be available to the Cython caller. They are returned by the Train entry point.

mjuarezm commented 7 years ago

Hi @suiji Thanks for the quick reply. I think I found a way to access them using pyborist:

clf = PyboristClassifier(n_estimators=1000)
[...]
print clf.estimators_['predInfo']

Is this correct? Thank you!

suiji commented 7 years ago

Yes, "predInfo" is the name of the field. Be careful with the predictor ordering, though:

Predictors are blocked according to their front-end type. At the moment there are only two types, numeric and categorical. In any case, the 'ith' element of 'predInfo' is not necessarily the 'ith' predictor. There is a vector, predMap[], that can be used to map back to the original ordering.

There have been several refactorings of the Core and BridgeR subprojects since Pyborist was last modified. It is possible that 'predMap[]' may need to be "reactivated".

mjuarezm commented 7 years ago

Oh, thanks for the heads up. That's very relevant.

suiji commented 7 years ago

Let me know how I can help further. The easier it is for people to exploit and contribute to the project, the more successful it may become.

mjuarezm commented 7 years ago

Thank you and congrats for this project! So, related to the cathegorical and numeric types and the ordering of perdInfo: we're encoding the categorical predictors as dummy binary predictors. Is there any guarantee that the order will be preserved if all the predictors are numeric?

suiji commented 7 years ago

Yes, the block-wise suborderings respect the original predictor order. Provided that the one-hot encodings are passed to Pyborist() on entry, then, the internal and external orderings should be identical.

mjuarezm commented 7 years ago

Great, thanks. I may give a stab to exposing predMap in the python wrapper. I actually sent some minor pull requests to @fyears but they're still pending.

suiji commented 7 years ago

@fyears may be inactive on the project. If so, I may be able to assist with a merge to the "main" branch. I am pedal-to-the-metal on GPU support right now so, for the short term, do not have a lot of cycles.

In any case, the respective organizations of the R and Python bridges should be quite similar. From what @fyears has said, the main difference appears to be that the Rcpp glue "language" exposes some features of R that have no direct counterparts in Python/Cython. For these, it becomes necessary to include Pandas and NumPy in order to introduce comparable functionality.

mjuarezm commented 7 years ago

Okay. I'll send you the PRs.

BTW, I merged the latest changes in Core and seems that the pyborist is quite behind and would require a major refactoring. I don't have much experience with cython and little time to learn, I think somebody with more experience on python wrappers would be a better candidate for this.

But no worries. As long as all numeric predictors keep the order in predInfo, we are good. Thanks a lot for your help!

suiji commented 7 years ago

The changes have been merged into the Python subproject. Thank you for submitting them.

It is unfortunate that the Core and bridge code have diverged. In addition to bug fixes and scalability enhancements, the latest versions of the Core feature the ability to compress sparse predictors. For one-hot codings, in particular, this could result in a much lower memory footprint.

gse-cc-git commented 6 years ago

Hello, Sorry to dig out that post.

I the R version, do we also have to "re-wire" the predictor label to its position using signature$predMap (like stated earlier in this thread) when using training$info to get a clue of the predictor's importance in the model ?

More explicitely, when I read

rf$signature$predMap
 [1]  0  1  2  3  4  5  6  7  8 10 11 14 16 17 18 19 21 25 26 27 28 30 31 32 33 34 35 36 37 38  9 12 13 15 20 22 23 24 29

does it mean that the first information value in the vector rf$training$info maps to the first column (indice 0) of the training dataframe ? and similarly the last maps to the 30th column (indice 29) ?

Thanks for your help and for your work.

suiji commented 6 years ago

Sorry to dig out that post.

Not at all. Always enjoy continuing the conversation.

I [sic] the R version, do we also have to "re-wire" the predictor label to its position using signature$predMap (like stated earlier in this thread) when using training$info to get a clue of the predictor's importance in the model ?

No. The information in the training data structure should reflect the zero-based predictor position with respect to the user's data.

The predMap field of the signature data structure indicates the internal predictor ordering employed for training and prediction. It should only be of interest to maintainers of the package and, possibly, developers of tools employing the Arborist's internal representation.