kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
495 stars 112 forks source link

The problrm of RRCF training data to get the model #78

Open Zhoulinfeng0510 opened 4 years ago

Zhoulinfeng0510 commented 4 years ago

Can RRCF obtain a model from the training set data, and then use this model to detect anomalies in the new data stream?

mdbartos commented 4 years ago

Yes. In this case you would:

You can also use a similar approach for classification: https://klabum.github.io/rrcf/classification.html

Zhoulinfeng0510 commented 4 years ago

yep! I want to know more about the method of obtaining such a model. My current understanding is to use the to_dict function in the API interface. I wonder if this is correct? If so, can you please give me a specific code here? Thank you very much for your reply.

mdbartos commented 4 years ago

This should work:

Train model (same example as in README)

import numpy as np
import pandas as pd
import rrcf

# Set parameters
np.random.seed(0)
n = 2010
d = 3
num_trees = 10
tree_size = 10

# Generate data
X = np.zeros((n, d))
X[:1000,0] = 5
X[1000:2000,0] = -5
X += 0.01*np.random.randn(*X.shape)

# Construct forest
forest = []
while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(n, size=(n // tree_size, tree_size),
                           replace=False)
    # Add sampled trees to forest
    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)

Save forest to json file

# Write learned model to json file
import json

# Convert forest to list of dictionaries
out_json = [tree.to_dict() for tree in forest]

# Write forest to file
with open('forest.json', 'w') as outfile:
    json.dump(out_json, outfile)

Read forest from json file

# Read json file into new forest
with open('forest.json', 'r') as infile:
    forest_obj = json.load(infile)

new_forest = []
for tree_obj in forest_obj:
    tree = rrcf.RCTree.from_dict(tree_obj)
    new_forest.append(tree)

Compare:

>>> forest[0]

>>> 
─+
 β”œβ”€β”€β”€+
 β”‚   β”œβ”€β”€(6)
 β”‚   └───+
 β”‚       β”œβ”€β”€β”€+
 β”‚       β”‚   β”œβ”€β”€(1)
 β”‚       β”‚   └──(4)
 β”‚       └──(8)
 └───+
     β”œβ”€β”€β”€+
     β”‚   β”œβ”€β”€(0)
     β”‚   └───+
     β”‚       β”œβ”€β”€β”€+
     β”‚       β”‚   β”œβ”€β”€(9)
     β”‚       β”‚   └──(5)
     β”‚       └──(2)
     └───+
         β”œβ”€β”€(3)
         └──(7)
>>> new_forest[0]

>>>
─+
 β”œβ”€β”€β”€+
 β”‚   β”œβ”€β”€(6)
 β”‚   └───+
 β”‚       β”œβ”€β”€β”€+
 β”‚       β”‚   β”œβ”€β”€(1)
 β”‚       β”‚   └──(4)
 β”‚       └──(8)
 └───+
     β”œβ”€β”€β”€+
     β”‚   β”œβ”€β”€(0)
     β”‚   └───+
     β”‚       β”œβ”€β”€β”€+
     β”‚       β”‚   β”œβ”€β”€(9)
     β”‚       β”‚   └──(5)
     β”‚       └──(2)
     └───+
         β”œβ”€β”€(3)
         └──(7)
Zhoulinfeng0510 commented 4 years ago

Okay, I think I already understand how RRCF works like this! Thank you very much! :) After further research, I found another problem: For multi-dimensional streaming data, calculating codisp will be a problem. I used shingle to create a sliding window. This data format is m x n, but the insert_piont function will only process 1 x d data. In this regard, rrcf will have a better way to calculate the anomaly scores of multidimensional and sliding window data?

mdbartos commented 4 years ago

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)] ... [x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)] ...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

Zhoulinfeng0510 commented 3 years ago

Thank you very much for your sincere reply, I have solved the above problem perfectly. However, I have the following problems when using RRCF. In Figure 1, it can be seen that there is a segment in the middle of the data (orange line) with obvious abnormalities. However, in the second picture, the highest anomaly score of the anomaly segment is only 0.25, and the anomaly score of the later segments with little anomaly is occasionally 0.25. This makes me very confused. Figure_1 Figure_2

yasirroni commented 3 years ago

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)] ... [x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)] ...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

This should be added to the doc example (didn't see it, either I miss it or not documented).