Closed sapols closed 5 years ago
Hi @sapols
Taking a look at this now.
I suspect that it's a bug with how duplicate points are handled.
Thanks, MDB
Thank you for responding so quickly, MDB. Let me know if I can provide any more info or help in any way.
Ok. I believe I've found the reason the bug is being triggered.
It's happening when you try to insert a point that is a near-duplicate, but doesn't meet the tolerance criteria for a duplicate in at least one dimension. So, the point that it failed on for me was:
array([24.11841722, 24.11841722, 24.11841722, 24.11841722, 24.11841722,
24.11841722, 24.11841722, 24.11841722, 24.11841722, 24.11841722,
24.11841722, 24.11841722, 24.11841722, 24.11841722, 24.11841722,
24.11841722, 24.11841722, 24.09805235])
The difference between this point and the nearest duplicate was:
array([ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , -0.02036487])
So it wasn't identified as a duplicate, but was possibly too close to the nearest point to successfully find a cut. I will have to look into why the algorithm wasn't able to find a cut. Another issue is that it was trying to be inserted into a single-point tree (all duplicates), which might be triggering some kind of edge case.
Edit: I also just realized that the leaf depth is -1 for all the duplicate leaves, which suggests that this attribute was decremented incorrectly.
Edit 2: The error seems to occur shortly after the first time the first negative depth is encountered, so I believe this is the main problem. Still trying to figure out what causes this to happen. Sometimes it can take 700,000+ insertions for it to occur.
I think I may have found the issue. In forget_point
:
# If parent is the root...
if parent is self.root:
# Delete parent
del parent
# Set sibling as new root
sibling.u = None
if isinstance(sibling, Leaf):
sibling.d = 0
self.root = sibling
# Update depths
self.map_leaves(sibling, op=self._increment_depth, inc=-1)
return self.leaves.pop(index)
Should be something like:
# If parent is the root...
if parent is self.root:
# Delete parent
del parent
# Set sibling as new root
sibling.u = None
self.root = sibling
# Update depths
if isinstance(sibling, Leaf):
sibling.d = 0
else:
self.map_leaves(sibling, op=self._increment_depth, inc=-1)
return self.leaves.pop(index)
Still need to test.
Hi @sapols,
I believe I fixed the issue. I've pulled the modified code into master. Still need to upload to pypi.
Let me know if you are still running into this issue.
A couple notes:
Also, instead of FIFO sampling, I'd probably recommend using something like reservoir sampling: https://en.wikipedia.org/wiki/Reservoir_sampling
Thanks, MDB
@mdbartos Thank you for the fix! My dataset doesn't break the algorithm anymore. And I appreciate the notes. I'm a satisfied customer.
I was essentially replicating the streaming example in this repo except with my own dataset and the code broke with the error shown in the screenshot.
My full code is here.
The bug is at line 460 of
rrcf.py
:local variable 'branch' referenced before assignment
. What's going wrong is beyond me. Will someone please look into this? I work at LASP, a laboratory in Boulder, CO, and we're considering using this code in production, but we can't while this bug exists.