kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
495 stars 112 forks source link

Added a method to calculate feature importance #100

Closed JavierVeraOlmos closed 1 year ago

JavierVeraOlmos commented 1 year ago

I'm working on a project to detect anomalies and this model works pretty well, however, I need to calculate which of the features are the most used by the model to determine that the point is an anomaly. I found an issue,#90, that also requested a way to do this. So maybe it will be useful.

I calculate the feature's importance in a very simple way:

While calculating the CoDisp for each leaf we can store the dimension used in the cut that obtained the maximum Disp(C)/|C|, instead of using a vector (or a series like in the example) we could use a matrix with the number of points as columns and the number of possible dimensions as rows. We can now store on it the amount of CoDisp found at each dimension. If we want to calculate the anomalies we just need to calculate the sum over all columns for each data point and calculate the mean, just like the batch example. If we want now to obtain the most important feature used to calculate the CoDisp, we can first calculate the anomalies using a threshold and then calculate the mean CoDisp over the rows of the anomalies to obtain which dimension contributed the most to the CoDisp of the anomalies.

I have implemented the method _codisp_with_cutdimension and also an example using it in the readme.

mdbartos commented 1 year ago

Very cool! Thanks for sharing. Can you add a unit test of the new functionality to test/test_rrcf.py?

JavierVeraOlmos commented 1 year ago

I added a unit test just in the same fashion as the one used for the codisp.

mdbartos commented 1 year ago

Looks good. Just add your example to the docs in addition to the readme: https://github.com/kLabUM/rrcf/tree/master/docs

Either under index.md or anomaly-scoring.md or both.

JavierVeraOlmos commented 1 year ago

I think is ok to do not write an explanation in the anomaly-scoring.md because I don't have any document or paper that bases my assumption of the feature importance calculation, at least explicitly, however, I have found this paper while looking for something that backed up the calculation, they seem to try to calculate feature importance for IF and at least for my first read over it, it seems that they end up doing something similar to the CoDisp but combining it with the deep of the node and dividing the points in inlier and outlier by the model ( applying a threshold over the anomaly score). So it seems that they are doing something similar to my code but in a better way and applied to IF. I don't know maybe is worth checking it out and coding something more sophisticated base on their algorithm.

But in the meanwhile, my pull request could do something similar.

mdbartos commented 1 year ago

Great, let me know if it's ready and I will pull it in. Consider creating an issue to explore the idea of other feature importance detection methods if you are interested.

JavierVeraOlmos commented 1 year ago

I think it is ready. 👍