kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
488 stars 111 forks source link

inference on new data samples #70

Open sophark opened 4 years ago

sophark commented 4 years ago

Hi, Thanks for implementing this. I have a use case that need train the rrcf using some given dataset, and then predict on unseen data samples this haven't been seen during training period.

Can we use batch mode to achieve that? One simple solution I can come up with is to first insert that point to the forest, calculate the codisp, and then delete it. I am wondering is there any smarter ways to save the inference time?

Thanks.

mdbartos commented 4 years ago

That's probably the most flexible way to do it (create forest from point set S using batch mode -> insert new point x into each tree -> compute codisp -> delete point x). But yes, it will probably be slow. Parallelizing can help though.

I'm not sure if this is helpful, but note that the insert_point algorithm is guaranteed to produce a tree drawn from RRCF(S \union x), where S is a point set, and x is an additional point.

In other words, the following two trees are statistically indistinguishable, and the codisp of x will be the same in expectation:

sophark commented 4 years ago

That's probably the most flexible way to do it (create forest from point set S using batch mode -> insert new point x into each tree -> compute codisp -> delete point x). But yes, it will probably be slow. Parallelizing can help though.

Thanks for your hints. Yes, it indeed a little bit slow without parallelizing. Do you know which step above consume most of time and its time complexity? I guess maybe the insert point step?

mdbartos commented 4 years ago

Yeah, I would say insert_point is the slowest step. I have time breakdowns here: https://github.com/kLabUM/rrcf/issues/28