lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
899 stars 105 forks source link

`make_dense_tree()` with `angular=True` can segfault on poorly-behaved datasets #209

Closed jakobhansen-blai closed 1 year ago

jakobhansen-blai commented 1 year ago

Minimal(ish) code to reproduce (I have tested this particular example with Python 3.9.16, numba 0.56.4, pynndescent 0.5.8 on ARM macOS, but I don't think the environment should matter much):

import numpy as np
from pynndescent.rp_trees import make_dense_tree

X1 = np.array(
    [
        100.01764, 100.01883, 100.021, 100.02009, 100.01863, 100.01916, 100.01936, 100.0154,
        100.0147, 100.01754, 100.0152, 100.0174, 100.017204, 100.02037, 100.02076, 100.01631,
        100.016014, 100.02034, 100.01806, 100.01909, 100.015465, 100.01851, 100.02122, 100.020515,
        100.016495, 100.01948, 100.01732, 100.02095, 100.01737, 100.01559, 100.01755, 100.016914,
    ],
    dtype=np.float32,
).reshape((-1, 1))
X = np.concatenate([X1, np.full((32, 1), 100, dtype=np.float32)], axis=1)

rng_state = np.array([2483244587, 2980075795, 3249322652], dtype=np.int64)
make_dense_tree(X, rng_state, angular=True)

What seems to be happening is that floating point rounding errors can lead angular_random_projection_split() to always assign all points to one side of the split. If this happens before the maximum leaf size is reached, the tree function recurses infinitely and eventually overflows the stack. I haven't checked if the same is true for euclidean_random_projection_split(), but it's at least conceivable that it could happen there as well.

This is a pretty bad dataset, but something like it might reasonably show up as a subset of something larger, and it's not unlikely that all the problematic points would eventually all end up in the same leaf. It's probably not reasonable to expect high-quality neighbors on this data, but it shouldn't cause a crash.

A few possibilities for fixing this:

I can put together a PR for one or more of these if you let me know what solutions are preferred. Thanks!

jakobhansen-blai commented 1 year ago

Ah, this appears to be closely related to https://github.com/lmcinnes/umap/issues/99. (Didn't think to check the UMAP issues before, but found a reference in the tests.) The check for abs(margin) < EPS looks like it was added in response to that issue. My example is a set of nonidentical data points that nevertheless sneaks by this check.

My proposal is to check for trivial splits and do a random assignment in that case, and also add a depth limit for trees as a failsafe. I'll submit a PR shortly.