kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
495 stars 112 forks source link

RCTree cannot handle when the data consists of only one unique value #88

Open kongwilson opened 3 years ago

kongwilson commented 3 years ago

I ran into issues when a subset of my sample data points only contain ONE unique value. How should we handle such an exception?

The error message basically suggests a NaN value for probability (caused by division by zero). I tried to turn this into a uniform distribution, but it caused subsequent issue after a cut the right side contains no values. I think this violates the principle of the RRCF algo. Do we have better way of resolving such cases?

File "<ipython-input-2-b3a957a401e5>", line 139, in <listcomp>
    rrcf.RCTree(x[ix], index_labels=ix) for ix in ixs]
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 106, in __init__
    self._mktree(X, S, N, I, parent=self)
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 177, in _mktree
    S1, S2, branch = self._cut(X, S, parent=parent, side=side)
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 159, in _cut
    q = self.rng.choice(self.ndim, p=l)
  File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
kongwilson commented 3 years ago

I have a walkaround for this scenario. If the dataset contains only one unique value, the tree can be build by creating an empty tree and insert 'new' data point into it iteratively.

kongwilson commented 3 years ago

The idea of the walkaround is if you reach a single value, instead of creating a tree with all the observations, you insert a point to a empty tree. An example code snippet would be like:

forest = []
while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(n, size=(n // tree_size, tree_size), replace=False)
    # Add sampled trees to forest
    trees = []
    for ix in ixs:
        if len(np.unique(x[ix])) > 1:
            tree = rrcf.RCTree(x[ix], index_labels=ix)
        else:  # If a subset contains only one value, RCTree will bug
            tree = rrcf.RCTree()
            for point, label in zip(x[ix], ix):
                tree.insert_point(point=point, index=label)
        trees.append(tree)
    forest.extend(trees)
kongwilson commented 3 years ago

Hey buddy,

I have posted the code for the walkaround. Feel free to check it out.

Cheers W

在 2021年5月5日,10:31,titaii2 @.***> 写道:

 I have a walkaround for this scenario. If the dataset contains only one unique value, the tree can be build by creating an empty tree and insert 'new' data point into it iteratively.

Hello :) Can you post your solution code here ? Because I have the same problem when my data is populated by X1 at 98% and X2 at 2% which makes the sample (created with "ixs") only populated by X1 which cause the same error "probabilities contain NaN" while doing "l /= l.sum()"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.