kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
495 stars 112 forks source link

Streaming code #62

Open ashujainml opened 5 years ago

ashujainml commented 5 years ago

Hi,

I tested the streaming code( https://klabum.github.io/rrcf/streaming.html) and found that all trees are similar to each other compared to the batch processing code (https://klabum.github.io/rrcf/batch.html). I spent some time understanding the issue and noticed that if we always remove the oldest point from all trees and insert new point in all trees, all trees will be identical in nature.

If tree is above permitted size...

    if len(tree.leaves) > tree_size:
        # Drop the oldest point (FIFO)
        tree.forget_point(index - tree_size)
    # Insert the new point into the tree
    tree.insert_point(point, index=index)

Am I missing something?

Chmavo commented 4 years ago

I have the same question as @ashujainml. Shouldn't the points inserted in to each tree in the forest be different, similar to what is accomplished in the batch examples with uniform random sampling?

Chmavo commented 4 years ago

@mdbartos if you have time could you please comment on this issue? I am using the same code as the sine wave streaming example, and when I use a seed value for the random number generator, all of the trees are similar. I get the same results with 1 tree as I do with 40 trees, and I see the same behavior with my own sample data.

mdbartos commented 4 years ago

Greetings,

Thanks for the question. Can you tell me what you mean by similar? Do you mean the trees are all exactly the same?

This should not happen unless there's a bug. Even if the trees incorporate the exact same points, the structure of the trees should be different because the partitioning algorithm is randomized.

See, for example:

import numpy as np
import rrcf
np.random.seed(0)

X = np.random.randn(8, 3)

tree_0 = rrcf.RCTree()
tree_1 = rrcf.RCTree()

for index, point in enumerate(X):
    tree_0.insert_point(point, index=index)
    tree_1.insert_point(point, index=index)
print(tree_0)

─+
 ├───+
 │   ├───+
 │   │   ├──(6)
 │   │   └───+
 │   │       ├───+
 │   │       │   ├───+
 │   │       │   │   ├───+
 │   │       │   │   │   ├──(4)
 │   │       │   │   │   └──(2)
 │   │       │   │   └──(0)
 │   │       │   └──(7)
 │   │       └──(5)
 │   └──(1)
 └──(3)
print(tree_1)

─+
 ├───+
 │   ├──(6)
 │   └──(2)
 └───+
     ├───+
     │   ├───+
     │   │   ├───+
     │   │   │   ├──(7)
     │   │   │   └───+
     │   │   │       ├──(3)
     │   │   │       └──(4)
     │   │   └──(5)
     │   └──(0)
     └──(1)

Thus, even with the exact same point set you should be able to build an ensemble. Let me know if this helps.

Thanks, MDB

Chmavo commented 4 years ago

@mdbartos thank you very much for the reply. After reviewing your example and replicating it with my data, I believe the error was in my understanding of how random_state worked when constructing the random-cut trees. If i pass an integer value as random_state to RCTree(), I get the same trees every time when I use the same points. For example, the code below yields identical trees:

tree_0 = rrcf.RCTree(random_state=123)
tree_1 = rrcf.RCTree(random_state=123)
X = np.random.randn(8, 3)
for index, point in enumerate(X):
    tree_0.insert_point(point, index=index)
    tree_1.insert_point(point, index=index)
mdbartos commented 4 years ago

Ah now I see. Yes, this is the reason that random_state was added--it gives you a way of replicating exactly the same tree (for testing purposes, replicating results, etc.). Otherwise you'd get a different tree each time.