kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
488 stars 111 forks source link

QUESTION: Simulating sampling of points in streaming detection #91

Open stianvale opened 3 years ago

stianvale commented 3 years ago

Hi! I've tested both your implementation of 'streaming detection' and 'batch detection'. So far, I'm getting the best results with the 'batch detection'. However, I want to use the streaming approach to dynamically update the model according to a continuous stream of data.

My current understanding is that 'batch detection' performs better because of the random sampling of points. With 'streaming detection', all trees contain the same points. Therefore, I tested an approach where some points are randomly deleted from trees after calculating the codisp. That way, the trees will contain different points, which in way simulates random sampling of points. My current results tells me that this works well.

Does this sound like a valid alternative to the standard 'streaming detection', or are there some traps I'm missing here?

mdbartos commented 3 years ago

Greetings,

The method for sampling included in the README was chosen for demonstration purposes---the implementation is short and easy to read. It's definitely not the only way to do sampling, and different sampling methods are encouraged.

The original RRCF paper proposes 'reservoir sampling', which would correspond to uniform sampling in time for the batch mode case. (See: https://en.wikipedia.org/wiki/Reservoir_sampling)

Ultimately the choice of sampling method will depend on the user's needs---namely, how far back in time do you want to algorithm to 'remember'.

MDB

stianvale commented 3 years ago

Thanks for your response, @mdbartos !

Cool, yeah, I see that's the default sampling technique of Sagemaker's RRCF as well. I'll test out reservoir sampling then. Have you implemented it for this repo before? In that case, maybe you could share the code?