Open ashujainml opened 5 years ago
I have the same question as @ashujainml. Shouldn't the points inserted in to each tree in the forest be different, similar to what is accomplished in the batch examples with uniform random sampling?
@mdbartos if you have time could you please comment on this issue? I am using the same code as the sine wave streaming example, and when I use a seed value for the random number generator, all of the trees are similar. I get the same results with 1 tree as I do with 40 trees, and I see the same behavior with my own sample data.
Greetings,
Thanks for the question. Can you tell me what you mean by similar? Do you mean the trees are all exactly the same?
This should not happen unless there's a bug. Even if the trees incorporate the exact same points, the structure of the trees should be different because the partitioning algorithm is randomized.
See, for example:
import numpy as np
import rrcf
np.random.seed(0)
X = np.random.randn(8, 3)
tree_0 = rrcf.RCTree()
tree_1 = rrcf.RCTree()
for index, point in enumerate(X):
tree_0.insert_point(point, index=index)
tree_1.insert_point(point, index=index)
print(tree_0)
─+
├───+
│ ├───+
│ │ ├──(6)
│ │ └───+
│ │ ├───+
│ │ │ ├───+
│ │ │ │ ├───+
│ │ │ │ │ ├──(4)
│ │ │ │ │ └──(2)
│ │ │ │ └──(0)
│ │ │ └──(7)
│ │ └──(5)
│ └──(1)
└──(3)
print(tree_1)
─+
├───+
│ ├──(6)
│ └──(2)
└───+
├───+
│ ├───+
│ │ ├───+
│ │ │ ├──(7)
│ │ │ └───+
│ │ │ ├──(3)
│ │ │ └──(4)
│ │ └──(5)
│ └──(0)
└──(1)
Thus, even with the exact same point set you should be able to build an ensemble. Let me know if this helps.
Thanks, MDB
@mdbartos thank you very much for the reply. After reviewing your example and replicating it with my data, I believe the error was in my understanding of how random_state worked when constructing the random-cut trees. If i pass an integer value as random_state to RCTree(), I get the same trees every time when I use the same points. For example, the code below yields identical trees:
tree_0 = rrcf.RCTree(random_state=123)
tree_1 = rrcf.RCTree(random_state=123)
X = np.random.randn(8, 3)
for index, point in enumerate(X):
tree_0.insert_point(point, index=index)
tree_1.insert_point(point, index=index)
Ah now I see. Yes, this is the reason that random_state was added--it gives you a way of replicating exactly the same tree (for testing purposes, replicating results, etc.). Otherwise you'd get a different tree each time.
Hi,
I tested the streaming code( https://klabum.github.io/rrcf/streaming.html) and found that all trees are similar to each other compared to the batch processing code (https://klabum.github.io/rrcf/batch.html). I spent some time understanding the issue and noticed that if we always remove the oldest point from all trees and insert new point in all trees, all trees will be identical in nature.
If tree is above permitted size...
Am I missing something?