kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
488 stars 111 forks source link

'branch' variable can be referenced before assignment #56

Closed sapols closed 5 years ago

sapols commented 5 years ago

I was essentially replicating the streaming example in this repo except with my own dataset and the code broke with the error shown in the screenshot.

rrcf bug

My full code is here.

The bug is at line 460 of rrcf.py: local variable 'branch' referenced before assignment. What's going wrong is beyond me. Will someone please look into this? I work at LASP, a laboratory in Boulder, CO, and we're considering using this code in production, but we can't while this bug exists.

mdbartos commented 5 years ago

Hi @sapols

Taking a look at this now.

I suspect that it's a bug with how duplicate points are handled.

Thanks, MDB

sapols commented 5 years ago

Thank you for responding so quickly, MDB. Let me know if I can provide any more info or help in any way.

mdbartos commented 5 years ago

Ok. I believe I've found the reason the bug is being triggered.

It's happening when you try to insert a point that is a near-duplicate, but doesn't meet the tolerance criteria for a duplicate in at least one dimension. So, the point that it failed on for me was:

array([24.11841722, 24.11841722, 24.11841722, 24.11841722, 24.11841722,
       24.11841722, 24.11841722, 24.11841722, 24.11841722, 24.11841722,
       24.11841722, 24.11841722, 24.11841722, 24.11841722, 24.11841722,
       24.11841722, 24.11841722, 24.09805235])

The difference between this point and the nearest duplicate was:

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , -0.02036487])

So it wasn't identified as a duplicate, but was possibly too close to the nearest point to successfully find a cut. I will have to look into why the algorithm wasn't able to find a cut. Another issue is that it was trying to be inserted into a single-point tree (all duplicates), which might be triggering some kind of edge case.

Edit: I also just realized that the leaf depth is -1 for all the duplicate leaves, which suggests that this attribute was decremented incorrectly.

Edit 2: The error seems to occur shortly after the first time the first negative depth is encountered, so I believe this is the main problem. Still trying to figure out what causes this to happen. Sometimes it can take 700,000+ insertions for it to occur.

mdbartos commented 5 years ago

I think I may have found the issue. In forget_point:

        # If parent is the root...
        if parent is self.root:
            # Delete parent
            del parent
            # Set sibling as new root
            sibling.u = None
            if isinstance(sibling, Leaf):
                sibling.d = 0
            self.root = sibling
            # Update depths
            self.map_leaves(sibling, op=self._increment_depth, inc=-1)
            return self.leaves.pop(index)

Should be something like:

        # If parent is the root...
        if parent is self.root:
            # Delete parent
            del parent
            # Set sibling as new root
            sibling.u = None
            self.root = sibling
            # Update depths
            if isinstance(sibling, Leaf):
                sibling.d = 0
            else:
                self.map_leaves(sibling, op=self._increment_depth, inc=-1)
            return self.leaves.pop(index)

Still need to test.

mdbartos commented 5 years ago

Hi @sapols,

I believe I fixed the issue. I've pulled the modified code into master. Still need to upload to pypi.

Let me know if you are still running into this issue.

A couple notes:

Also, instead of FIFO sampling, I'd probably recommend using something like reservoir sampling: https://en.wikipedia.org/wiki/Reservoir_sampling

Thanks, MDB

sapols commented 5 years ago

@mdbartos Thank you for the fix! My dataset doesn't break the algorithm anymore. And I appreciate the notes. I'm a satisfied customer.