Closed jakobhansen-blai closed 1 year ago
Ah, this appears to be closely related to https://github.com/lmcinnes/umap/issues/99. (Didn't think to check the UMAP issues before, but found a reference in the tests.) The check for abs(margin) < EPS
looks like it was added in response to that issue. My example is a set of nonidentical data points that nevertheless sneaks by this check.
My proposal is to check for trivial splits and do a random assignment in that case, and also add a depth limit for trees as a failsafe. I'll submit a PR shortly.
Minimal(ish) code to reproduce (I have tested this particular example with Python 3.9.16, numba 0.56.4, pynndescent 0.5.8 on ARM macOS, but I don't think the environment should matter much):
What seems to be happening is that floating point rounding errors can lead
angular_random_projection_split()
to always assign all points to one side of the split. If this happens before the maximum leaf size is reached, the tree function recurses infinitely and eventually overflows the stack. I haven't checked if the same is true foreuclidean_random_projection_split()
, but it's at least conceivable that it could happen there as well.This is a pretty bad dataset, but something like it might reasonably show up as a subset of something larger, and it's not unlikely that all the problematic points would eventually all end up in the same leaf. It's probably not reasonable to expect high-quality neighbors on this data, but it shouldn't cause a crash.
A few possibilities for fixing this:
EPS
to1e-5
. However, it seems that even with fairly large values of epsilon, it is still possible that the split function assigns all points to one side, which would not happen with exact arithmetic (since at least theleft
andright
vectors should be assigned to opposite sides).hyperplane_vector
. With exact arithmetic this wouldn't change the split assignments, but I'm not sure if this makes the procedure more or less numerically stable in general.left
andright
to the appropriate sides).I can put together a PR for one or more of these if you let me know what solutions are preferred. Thanks!