Open Zhoulinfeng0510 opened 4 years ago
Yes. In this case you would:
You can also use a similar approach for classification: https://klabum.github.io/rrcf/classification.html
yep! I want to know more about the method of obtaining such a model. My current understanding is to use the to_dict function in the API interface. I wonder if this is correct? If so, can you please give me a specific code here? Thank you very much for your reply.
This should work:
import numpy as np
import pandas as pd
import rrcf
# Set parameters
np.random.seed(0)
n = 2010
d = 3
num_trees = 10
tree_size = 10
# Generate data
X = np.zeros((n, d))
X[:1000,0] = 5
X[1000:2000,0] = -5
X += 0.01*np.random.randn(*X.shape)
# Construct forest
forest = []
while len(forest) < num_trees:
# Select random subsets of points uniformly from point set
ixs = np.random.choice(n, size=(n // tree_size, tree_size),
replace=False)
# Add sampled trees to forest
trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
forest.extend(trees)
# Write learned model to json file
import json
# Convert forest to list of dictionaries
out_json = [tree.to_dict() for tree in forest]
# Write forest to file
with open('forest.json', 'w') as outfile:
json.dump(out_json, outfile)
# Read json file into new forest
with open('forest.json', 'r') as infile:
forest_obj = json.load(infile)
new_forest = []
for tree_obj in forest_obj:
tree = rrcf.RCTree.from_dict(tree_obj)
new_forest.append(tree)
>>> forest[0]
>>>
β+
ββββ+
β βββ(6)
β ββββ+
β ββββ+
β β βββ(1)
β β βββ(4)
β βββ(8)
ββββ+
ββββ+
β βββ(0)
β ββββ+
β ββββ+
β β βββ(9)
β β βββ(5)
β βββ(2)
ββββ+
βββ(3)
βββ(7)
>>> new_forest[0]
>>>
β+
ββββ+
β βββ(6)
β ββββ+
β ββββ+
β β βββ(1)
β β βββ(4)
β βββ(8)
ββββ+
ββββ+
β βββ(0)
β ββββ+
β ββββ+
β β βββ(9)
β β βββ(5)
β βββ(2)
ββββ+
βββ(3)
βββ(7)
Okay, I think I already understand how RRCF works like thisοΌ Thank you very much! :) After further research, I found another problem: For multi-dimensional streaming data, calculating codisp will be a problem. I used shingle to create a sliding window. This data format is m x n, but the insert_piont function will only process 1 x d data. In this regard, rrcf will have a better way to calculate the anomaly scores of multidimensional and sliding window dataοΌ
If you want to use shingles, each point inserted into the tree should be of the form:
[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]
...
[x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)]
...
And so on.
Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.
Thank you very much for your sincere reply, I have solved the above problem perfectly. However, I have the following problems when using RRCF. In Figure 1, it can be seen that there is a segment in the middle of the data (orange line) with obvious abnormalities. However, in the second picture, the highest anomaly score of the anomaly segment is only 0.25, and the anomaly score of the later segments with little anomaly is occasionally 0.25. This makes me very confused.
If you want to use shingles, each point inserted into the tree should be of the form:
[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]
...[x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)]
...And so on.
Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.
This should be added to the doc example (didn't see it, either I miss it or not documented).
Can RRCF obtain a model from the training set data, and then use this model to detect anomalies in the new data stream?