kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
495 stars 112 forks source link

QUESTION: Feature importance #90

Open stianvale opened 3 years ago

stianvale commented 3 years ago

Hi, and thanks building this great repo!

I have a general question; what's the proper way to compute feature importance for RRCF? Basically, I want to know what features contribute the most to the collusive displacement value.

mdbartos commented 3 years ago

To clarify, do you mean: for a set of multidimensional points, which dimension contributes the most to the total codisp over all points in the dataset?

These three pages of the docs may be useful: Tree construction: https://klabum.github.io/rrcf/tree-construction.html Anomaly scoring: https://klabum.github.io/rrcf/anomaly-scoring.html Caveats: https://klabum.github.io/rrcf/caveats.html

Perhaps it would be helpful to specify a (mathematical) definition of feature importance for your problem of interest. Or perhaps you can describe the particular problem you are trying to solve.

stianvale commented 3 years ago

Thanks for your reply @mdbartos !

Yeah, what I'm asking for is: For a given multidimensional point, which dimensions contribute the most to that point's codisp.

I have a draft approach on this, that just compares the point's dimension values with the mean dimension values of all points. In this way, we can see what dimensions are differing the most from 'normal' behavior. But that is just a temporary proxy for feature importance.

So what I'm asking is if there is some way to deduct the feature importances of a point from the formula of codisp.

Does that make sense?

stianvale commented 3 years ago

Hi again! @mdbartos, have you ever experimented with computing the feature importance of a particular point? I think this would be a great addition to the current library in terms of improving the explainability of the anomalies.